Repository: pshah123/ChanceyNN
Branch: master
Commit: 7c5a9d6a475c
Files: 42
Total size: 15.0 KB

Directory structure:
gitextract_ysq5u7o2/

├── .gitignore
├── README.md
├── neuralnet/
│   ├── README.MD
│   ├── main.py
│   ├── predict.py
│   └── train/
│       └── model/
│           └── carnegie_mellon/
│               ├── model.ckpt-1.data-00000-of-00001
│               ├── model.ckpt-1.index
│               └── model.ckpt-1.meta
├── requirements.txt
├── templates/
│   └── website.html
├── test.py
├── train/
│   └── model/
│       ├── carnegie_mellon/
│       │   ├── model.ckpt-143090.data-00000-of-00001
│       │   ├── model.ckpt-143090.index
│       │   ├── model.ckpt-143090.meta
│       │   ├── model.ckpt-146593.data-00000-of-00001
│       │   ├── model.ckpt-146593.index
│       │   ├── model.ckpt-146593.meta
│       │   ├── model.ckpt-150292.data-00000-of-00001
│       │   ├── model.ckpt-150292.index
│       │   ├── model.ckpt-150292.meta
│       │   ├── model.ckpt-154090.data-00000-of-00001
│       │   ├── model.ckpt-154090.index
│       │   ├── model.ckpt-154090.meta
│       │   ├── model.ckpt-156002.data-00000-of-00001
│       │   ├── model.ckpt-156002.index
│       │   └── model.ckpt-156002.meta
│       └── harvard/
│           ├── model.ckpt-232897.data-00000-of-00001
│           ├── model.ckpt-232897.index
│           ├── model.ckpt-232897.meta
│           ├── model.ckpt-237769.data-00000-of-00001
│           ├── model.ckpt-237769.index
│           ├── model.ckpt-237769.meta
│           ├── model.ckpt-242576.data-00000-of-00001
│           ├── model.ckpt-242576.index
│           ├── model.ckpt-242576.meta
│           ├── model.ckpt-247728.data-00000-of-00001
│           ├── model.ckpt-247728.index
│           ├── model.ckpt-247728.meta
│           ├── model.ckpt-250000.data-00000-of-00001
│           ├── model.ckpt-250000.index
│           └── model.ckpt-250000.meta
└── website.py

================================================
FILE CONTENTS
================================================

================================================
FILE: .gitignore
================================================
.DS_STORE
*.git
.git
*.pyc
*.tfevents.*
*/__pycache__/*
*.csv
checkpoint
*.pb
*.pbtxt

================================================
FILE: README.md
================================================
# Chancey, college admissions predictor.

[ ![Codeship Status for pshah123/ChanceyNN](https://app.codeship.com/projects/920e3190-9232-0135-1d57-766c0c9c4a48/status?branch=master)](https://app.codeship.com/projects/250609) 

Chancey is a predictor for college admissions based on GPA and SAT2400 data. Surprisingly enough, despite claims of a `holistic` approach, most colleges easily reach ~80% accuracy on this model with ~50 samples of data.

## Reqs

- Python (prefer 3.x)
- Tensorflow (prefer newest, recommend GPU or high powered CPU)
- `console-logging` python module, for more beautiful logs, get it from `pip`
- `numpy`, highly recommend using an Anaconda distribution of Python 3
- `flask`, get it from `pip`

## How it works

This is probably the simplest neural network you'll see today. I simply implemented the DNN Classifier, but instead of using a traditional approach with hundreds of nodes, I messed around with the parameters and brought it to 10-20-10 for hidden layers. Extremely simple implementation and straightforward as both of our inputs are standard numbers.

After training on a corpus of GPA+SAT data, it can predict admissions.

## Training

See the README file in the `neuralnet` folder. You will need to call `main.py` from this directory, e.g. `python neuralnet/main.py .. args ..`.

Assemble a dataset CSV file. Cut 1/3 of the contents into another CSV file, this new file is your test dataset.

**Important: if you want the raw accuracy set both training and testing to same CSV and go for one step. Otherwise it always spits out 0.5 -- this is not correct, it's messing up because there are exactly 11 acceptances and 11 rejections in the test dataset. For you it might spit out another number that is the ratio of acceptances and rejections in test dataset. For these, just train to 150k steps or loss around 0.7 or below.**

![Console](images/cmd.PNG)

I have provided the CMU dataset I originally gathered by hand to train this network. More information on naming datasets is in the README file.

Quick stats: Geforce 1060, 6gb, ~4 minutes for 150k steps and ~78.5% accuracy.

Graph of loss over 150k steps:

![Loss](images/loss.PNG)

Graph of accuracy over 150k steps:

![Accuracy](images/loss.PNG)

## Predictions

`python website.py`, you'll need Flask.

![Form](images/form.png)

## FAQS

**Does this mean colleges don't care about me as a person for the most part?**

Perhaps, perhaps not. As my wonderful stat teacher pointed out to me, GPA/SAT are not independent from you as a person. It is likely that many individuals in the dataset had GPA/SAT scores correspondent to their extracurricular activities + essay quality. So no, this does not defiitely prove this. Rather, it suggests that GPA/SAT are powerful metrics that can be used to filter applicants.

**Won't this just scare me away from college apps? How can you be sure this works?**

I'm not sure. That's why the predictor uses language like `likely` and `unlikely`. This isn't perfect, and college admissions are often random and influenced by external factors I can't predict. Don't let this dissuade you from applying to a college. Rather, simply use this to filter through schools if you're like me and had trouble narrowing down your list from 20+.

**Isn't this network way too simple? Shouldn't you add an LSTM layer or RNN capabilities?**

It may be simple but in this case it *works*. I have implemented in later revisions LSTM cells, but this has seen minor improvements and I am not at liberty to open source those parts of the project. Rest assured that these are _minor_ improvements at best, at least in my experience.

If you have any ideas to make this more accurate, feel free to contribute! This repo is open to all.

================================================
FILE: neuralnet/README.MD
================================================
This repository is no longer maintained. It has been updated and cleaned to the last declassified/safe to release version which is from September 12, 2016.

Due to a switch to proprietary information, we are no longer updating the corpus, model, or python scripts.

Usage:

```python neuralnet/main.py path/to/dataset.csv path/to/test_dataset.csv #maxgpa #maxtestscore```

Run from main directory.

To restore checkpoints and train on one model, keep the dataset filename the same. To use a new model use a newly named dataset or delete the model from its directory under train/model/.

Note: with the Carnegie Mellon corpus and provided checkpoint (150000 steps), relatively high accuracy can be achieved upon further training.

Training stats:
Trained on GeForce 1060, 6GB. ~4 minutes for 150,000 steps @ 78.5% accuracy on a relatively small dataset.

No overfitting observed. Cross validation dataset and cross validation scripts are not provided due to licensing issues. :sadparrot:

*Interestingly, it would appear even if overfitting occurred, that this may not affect our true accuracy, as college admissions behave as if overfitted... but this is a conjecture to be tested at a later point.*

The `train` directory is at the root level. The `train` directory in this folder is used simply for testing.

================================================
FILE: neuralnet/main.py
================================================
from __future__ import absolute_import, division, print_function
import tensorflow as tf
import numpy as np
import os
import sys
from console_logging.console import Console
from sys import argv

usage = "\nUsage:\npython neuralnet/main.py path/to/dataset.csv path/to/crossvalidation_dataset.csv #MAX_GPA #MAX_TEST_SCORE\n\nExample:\tpython main.py harvard.csv 6.0 2400\n\nThe dataset should have one column of GPA and one column of applicable test scores, no headers."

console = Console()
console.setVerbosity(3) # only logs success and error
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'

try:
    script, dataset_filename, test_filename, maxgpa, maxtest = argv
except:
    console.error(str(sys.exc_info()[0]))
    print(usage)
    exit(0)

dataset_filename = str(dataset_filename)
maxgpa = float(maxgpa)
maxtest = int(maxtest)

if dataset_filename[-4:] != ".csv":
    console.error("Filetype not recognized as CSV.")
    print(usage)
    exit(0)

# Data sets
DATA_TRAINING = dataset_filename
DATA_TEST = test_filename
''' We are expecting features that are floats (gpa, sat, act) and outcomes that are integers (0 for reject, 1 for accept) '''
##

# Load datasets using tf contrib libraries
training_set = tf.contrib.learn.datasets.base.load_csv_without_header(filename=DATA_TRAINING,
                                                       target_dtype=np.int,features_dtype=np.float)
test_set = tf.contrib.learn.datasets.base.load_csv_without_header(filename=DATA_TEST,
                                                   target_dtype=np.int,features_dtype=np.float)
##

# First two columns are gpa, sat/act, which are our features
feature_columns = [tf.contrib.layers.real_valued_column("", dimension=2)]

# Build a neural network with 3 layers. We're putting the model into /train/model/
# I found 3 hidden layers with 10, 20, and 10 nodes respectively works well. You may find other setups.
classifier = tf.contrib.learn.DNNClassifier(feature_columns=feature_columns,
                                            hidden_units=[10, 20, 10],
                                            n_classes=3,
                                            model_dir="./train/model/"+dataset_filename[dataset_filename.rfind('/')+1:-4],
                                            config=tf.contrib.learn.RunConfig(
                                                save_checkpoints_secs=10))
##

# Helper functions
def get_train_inputs():
    x = tf.constant(training_set.data)
    y = tf.constant(training_set.target)
    return x, y

def get_test_inputs():
    x = tf.constant(test_set.data)
    y = tf.constant(test_set.target)
    return x, y
##

print("How many steps should we train for?")
maxsteps = int(input('> '))

# Create the classifier. Take maxsteps steps.
classifier.fit(input_fn=get_train_inputs, steps=maxsteps)

# Evaluate loss.
results = classifier.evaluate(input_fn=get_test_inputs, steps=1)
print(results)
console.success('\nFinished with loss {0:f}'.format(results['loss']))

print("\nPlease provide a GPA and test score to chance.")
cur_gpa = float(input('GPA: '))
print("Given "+str(cur_gpa))
test_score = int(input('Test Score: '))
def new_samples():
    return np.array([[0.0, 0], [cur_gpa,test_score], [maxgpa, maxtest]], dtype=np.float32)
predictions = list(classifier.predict(input_fn=new_samples))
console.success("Made predictions:")

def returnChance(chance):
    if chance==0:
        return "rejection"
    if chance==1:
        return "admission"

console.log("Testing:\nGPA: 0\nTest Score: 0\nPrediction: %s\nExpected: rejection"%returnChance(predictions[0]))
console.log("Testing:\nGPA: %0.1f\nTest Score: %d\nPrediction: %s\nExpected: admission"%(maxgpa, maxtest, returnChance(predictions[2])))
console.success("Predicting:\nGPA: %d\nTest Score: %d\nPrediction:%s"%(cur_gpa, test_score, returnChance(predictions[1])))

================================================
FILE: neuralnet/predict.py
================================================
from __future__ import absolute_import, division, print_function
import tensorflow as tf
import numpy as np
import os
from sys import argv
from console_logging.console import Console
console = Console()

usage="You shouldn't be running this file."

os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'
console.setVerbosity(3)  # only error, success, log

script = 'predict.py'
dataset_filename = './neuralnet/corpus/carnegie_mellon.csv'
maxgpa = 5.0
maxtest = 2400
dataset_filename = str(dataset_filename)
maxgpa = float(maxgpa)
maxtest = int(maxtest)
if dataset_filename[-4:] != ".csv":
    console.error("Filetype not recognized as CSV.")
    print(usage)
    exit(0)

# Data sets
DATA_TRAINING = dataset_filename
DATA_TEST = dataset_filename
''' We are expecting features that are floats (gpa, sat, act) and outcomes that are integers (0 for reject, 1 for accept) '''

# Load datasets using tf contrib libraries
training_set = tf.contrib.learn.datasets.base.load_csv_without_header(filename=DATA_TRAINING,
                                                                      target_dtype=np.int, features_dtype=np.float)
test_set = tf.contrib.learn.datasets.base.load_csv_without_header(filename=DATA_TEST,
                                                                  target_dtype=np.int, features_dtype=np.float)
##

# First two columns are gpa, sat/act, which are our features
feature_columns = [tf.contrib.layers.real_valued_column("", dimension=2)]
##

# Build a neural network with 3 layers. We're putting the model into /train/model/
# I found 3 hidden layers with 10, 20, and 10 nodes respectively works well. You may find other setups.
classifier = tf.contrib.learn.DNNClassifier(feature_columns=feature_columns,
                                            hidden_units=[10, 20, 10],
                                            n_classes=3,
                                            model_dir="./train/model/" +
                                            dataset_filename[dataset_filename.rfind(
                                                '/') + 1:-4],
                                            config=tf.contrib.learn.RunConfig(
                                                save_checkpoints_secs=60))
##

# Helper functions
def get_train_inputs():
    x = tf.constant(training_set.data)
    y = tf.constant(training_set.target)
    return x, y

def get_test_inputs():
    x = tf.constant(test_set.data)
    y = tf.constant(test_set.target)
    return x, y
##

maxsteps = 1
# Create the classifier. Take just one step, we're testing not training.
classifier.fit(input_fn=get_train_inputs, steps=maxsteps)

def predict(cur_gpa, testscore, test_type):
    #TODO: implement test_type
    gpa_in = cur_gpa
    testscore_in = testscore
    def new_samples():
        return np.array([[gpa_in, testscore_in]], dtype=np.float32)
    predictions = list(classifier.predict(input_fn=new_samples))
    return predictions

================================================
FILE: requirements.txt
================================================
tensorflow
console-logging
numpy

================================================
FILE: templates/website.html
================================================
<html>
    <head>
        <title>College Predictor</title>
        <link rel="stylesheet" media="screen" href="//netdna.bootstrapcdn.com/bootstrap/3.3.4/css/bootstrap.min.css">
        <script src="//code.jquery.com/jquery.js"></script>
        <script src="//netdna.bootstrapcdn.com/bootstrap/3.3.4/js/bootstrap.min.js"></script>
    </head>
    <body>
        
        <div class="container">
            
            <div class="page-header">
              <h1>The Dean<small> ® Priansh Shah 2017</small></h1>
            </div>
            <form action="/predict" method="POST" role="form">
                <legend>Chance yourself at {{ college['name'] }}! (accuracy: {{ college['accuracy'] }})</legend>
            
                <div class="form-group">
                    <label for="gpa">GPA</label>
                    <input type="text" class="form-control" id="gpa" name="gpa" placeholder="5.0">
                </div>
            
                <div class="form-group">
                    <label for="test_score">SAT 2400 Equivalent</label>
                    <input type="text" class="form-control" id="test_score" name="test_score" placeholder="2400">
                </div>                
            
                <button type="submit" class="btn btn-primary">Chance Me</button>
            </form>
            
        </div>
        
    </body>
</html>

================================================
FILE: test.py
================================================
from flask import Flask, render_template, request
import neuralnet.predict as pr
from console_logging.console import Console
console = Console()
app=Flask(__name__)
@app.route('/predict', methods=['POST'])
def predict():

    # get form variables and type them
    gpa = float(request.form["gpa"])
    score = int(request.form["test_score"])
    console.info("Chancing GPA: %d, SAT: %d"%(gpa,score))
    predictions=[]

    #TODO: implement test type. This is a stub.
    if score<=36:
        predictions=pr.predict(gpa,score,"ACT")
    elif score<=1600:
        predictions=pr.predict(gpa,score,"SAT1600")
    else:
        predictions = pr.predict(gpa,score,"SAT2400")
    ##

    if predictions[0]==1:
        return "Admission is likely."
    else:
        if predictions[0]==0:
            return "Admission is unlikely."
        return "Something went wrong."

@app.route('/')
def home():
    return render_template('website.html', college={'name':'CMU','accuracy':'78.6517'})

================================================
FILE: website.py
================================================
from flask import Flask, render_template, request
import neuralnet.predict as pr
from console_logging.console import Console
console = Console()
app=Flask(__name__)
@app.route('/predict', methods=['POST'])
def predict():

    # get form variables and type them
    gpa = float(request.form["gpa"])
    score = int(request.form["test_score"])
    console.info("Chancing GPA: %d, SAT: %d"%(gpa,score))
    predictions=[]

    #TODO: implement test type. This is a stub.
    if score<=36:
        predictions=pr.predict(gpa,score,"ACT")
    elif score<=1600:
        predictions=pr.predict(gpa,score,"SAT1600")
    else:
        predictions = pr.predict(gpa,score,"SAT2400")
    ##

    if predictions[0]==1:
        return "Admission is likely."
    else:
        if predictions[0]==0:
            return "Admission is unlikely."
        return "Something went wrong."

@app.route('/')
def home():
    return render_template('website.html', college={'name':'CMU','accuracy':'78.6517'})

if __name__ == '__main__':
    app.run(host='0.0.0.0')