Repository: ShuaiW/ml-cheatsheet
Branch: master
Commit: 136615833327
Files: 1
Total size: 13.9 KB

Directory structure:
gitextract_c__z9ki5/

└── README.md

================================================
FILE CONTENTS
================================================

================================================
FILE: README.md
================================================
# ML Cheatsheet

Assuming we have the dataset in a loadable format (e.g., csv), here are the steps
we follow to complete a machine learning project.

0. [Exploratory data analysis](#exploratory-data-analysis)
0. [Preprocessing](#preprocessing)
0. [Feature engineering](#feature-engineering)
0. [Machine learning](#machine-learning)

A couple of notes before we go on.

First of all, machine learning is a highly iterative field. This would entail
a loop cycle of the above steps, where each cycle is based on the feedback from
the previous cycle, with the goal of improving the model performance. One
example is that we need refit models when we engineered new features, and
test to see if these features are predicative.

Second, while in Kaggle competitions one can create a monster ensemble of models, in
production system often times such ensembles are not useful. They are
high maintenance, hard to interpret, and too complex to deploy. This is why
in practice it's often simpler model plus huge amount of data that wins.

Third, while some code snippets are reusable, each dataset has its own
uniqueness. Dataset-specific efforts are needed to build better models.

Bearing these points in mind, let's get our hands dirty.


## Exploratory data analysis

Exploratory data analysis (EDA) is an approach to analyze data sets to summarize their main characteristics,
often with plots. The goal of EDA is to get a deeper understanding of the
dataset, and to preprocess data and engineer features more effectively. Here
are some generic code snippets that can be applied to any structured dataset


import libraries
```python
import os
import fnmatch
import glob
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
```

data I/O
```python
df = pd.read_csv(file_path) # read in csv file as a DataFrame
df.to_csv(file_path, index=False) # save a DataFrame as csv file

# read all csv under a folder and concatenate them into a big dataframe
path = r'path'

# flat
all_files = glob.glob(os.path.join(path, "*.csv"))

# or recursively
all_files = [os.path.join(root, filename)
             for root, dirnames, filenames in os.walk(path)
             for filename in fnmatch.filter(filenames, '*.csv')]
             
df = pd.concat((pd.read_csv(f) for f in all_files))
```

data I/O zipped

```python
import pandas as pd
import zipfile

zf_path = 'file.zip'
zf = zipfile.ZipFile(zf_path) # zipfile.ZipFile object
all_files = zf.namelist() # list all zipped files
all_files = [f for f in all_files if f.endswith('.csv')] # e.g., get only csv
df = pd.concat((pd.read_csv(zf.open(f)) for f in all_files)) # concat all zipped csv into one dataframe
```

To a table in sqlite3 DB (then you can use [DB Browser for SQLite](http://sqlitebrowser.org/) to view and query the table)

```python
import sqlite3
import pandas as pd

df = pd.read_csv(csv_file) # read csv file
sqlite_file = 'my_db.sqlite3'
conn = sqlite3.connect(sqlite_file) # establish a sqlite3 connection

# if db file exists append the csv
df.to_sql(tablename, conn, if_exists='append', index=False)
```


data summary
```python
df.head() # return the first 5 rows
df.describe() # summary statistics, excluding NaN values
df.info(verbose=True, null_counts=True) # concise summary of the table
df.shape # shape of dataset
df.skew() # skewness for numeric columns
df.kurt() # unbiased kurtosis for numeric columns
df.get_dtype_counts() # counts of dtypes
```

display missing value proportion for each col
```python
for c in df.columns:
  num_na = df[c].isnull().sum()
  if num_na > 0:
    print round(num_na / float(len(df)), 3), '|', c
```


pairwise correlation of columns
```python
df.corr()
```


### plotting

plot heatmap of correlation matrix (of all numeric columns)
```python
cm = np.corrcoef(df.T)
sns.heatmap(cm, annot=True, yticklabels=df.columns, xticklabels=df.columns)
```

![heat-corr](/assets/heat-corr.png)

plot univariate distributions
```python
# single column
sns.distplot(df['col1'].dropna())

# all numeric columns
for c in df.columns:
  if df[c].dtype in ['int64', 'float64']:
    sns.distplot(df[c].dropna(), kde=False)
    plt.show()
```

![hist](/assets/hist.png)

plot kernel density estimaton (KED)
```python
# all continuous variables
for c in df.columns:
  if df[c].dtype in ['float64']:
    sns.kdeplot(df[c].dropna(), shade=True)
    plt.show()
```

![kde](/assets/kde.png)

plot pairwise relationships
```python
sns.pairplot(df.dropna())
```

![pairwise](/assets/pairwise.png)

**[hypertools](https://github.com/ContextLab/hypertools) is a python toolbox for
visualizing and manipulating high-dimensional data. This is desirable for the EDA
phase.**

visually explore relationship between features and target (in 3D space)
```python
import hypertools as hyp
import seaborn as sns
from sklearn import datasets

iris = datasets.load_iris()
X = iris.data
y = iris.target
hyp.plot(X,'o', group=y, legend=list(set(y)), normalize='across')
```

![hypertools](/assets/hypertools.png)


linear regression analysis using each PC
```python
from sklearn import linear_model
sns.set(style="darkgrid")
sns.set_palette(palette='Set2')

data = pd.DataFrame(data=X, columns=iris.feature_names)
reduced_data = hyp.reduce(hyp.tools.df2mat(data), ndims=3)

linreg = linear_model.LinearRegression()
linreg.fit(reduced_data, y)

sns.regplot(x=reduced_data[:,0],y=linreg.predict(reduced_data), label='PC1',x_bins=10)
sns.regplot(x=reduced_data[:,1],y=linreg.predict(reduced_data), label='PC2',x_bins=10)
sns.regplot(x=reduced_data[:,2],y=linreg.predict(reduced_data), label='PC3',x_bins=10)

plt.title('Correlation between PC and Regression Output')
plt.xlabel('PC Value')
plt.ylabel('Regression Output')
plt.legend()
plt.show()
```

![lg-pc](/assets/lg-pc.png)

break down by labels
```python
sns.set(style="darkgrid")
sns.swarmplot(y,reduced_data[:,0],order=[0, 1, 2])
plt.title('Correlation between PC1 and target')
plt.xlabel('Target')
plt.ylabel('PC1 Value')
plt.show()
```

![by-label](/assets/by-label.png)

For more use cases of hypertools, check [notebooks](https://github.com/ContextLab/hypertools-paper-notebooks)
 and [examples](http://hypertools.readthedocs.io/en/latest/auto_examples/index.html)


## Preprocessing

drop columns
```python
df.drop([col1, col2, ...], axis=1, inplace=True) # in place
new_df = df.drop([col1, col2, ...], axis=1) # create new df (overhead created)
```

handle missing values
```python
# fill with mode, mean, or median
df_mode, df_mean, df_median = df.mode().iloc[0], df.mean(), df.median()

df_fill_mode = df.fillna(df_mode)
df_fill_mean = df.fillna(df_mean)
df_fill_median = df.fillna(df_median)

# drop col with any missing values
df_drop_na_col = df.dropna(axis=1)
```

encode categorical features
```python
from sklearn.preprocessing import LabelEncoder

df_col = df.columns
col_non_num = [c for c in df_col if df[c].dtype == 'object']
for c in col_non_num:
  df[c] = LabelEncoder().fit_transform(df[c])
```


join two tables/dataframes
```python
df1.join(df2, on=col)
```

handle outliners (outliers can either be clipped, or removed.
**WARNING**: outliers are not always meant to be removed)

In the following example we assume df is all numeric, and has no missing values

clipping
```python
# clip outliers to 3 standard deviation
lower = df.mean() - df.std()*3
upper = df.mean() + df.std()*3
clipped_df = df.clip(lower, upper, axis=1)
```

removal
```pyhton
# remove rows that have outliers in at least one column
new_df = df[(np.abs(stats.zscore(df)) < 3).all(axis=1)]
```

filter
```python
# filter by one value
new_df = df[df.col==val]

# filter by multiple values
new_df = df[df.col.isin(val_list)]
```

## Feature engineering

### Transformation

one-hot encode categorical features; not necessary for tree-based algorithms
```python
# for a couple of columns
one_hot_df = pd.get_dummies(df[[col1, col2, ...]])

# for the whole dataframe
new_df = pd.get_dummies(df)
```

normalize numeric features (to range [0, 1])
```python
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
normalized_df = MinMaxScaler().fit_transform(df)
```

log transformation: for columns with highly skewed distribution, we can apply
the log transformation
```python
from scipy.special import log1p
transformed_col = df[col].apply(log1p)
```

![log](/assets/log.png)


### Creation

Feature creation is both domain and engineering efforts. With the help from
domain experts, we can craft more predicative features, but here are some generic
feature creation methods worth trying on any structured dataset

add feature: number of missing values
```python
df['num_null'] = df.isnull().sum(axis=1)
```

add feature: number of zeros
```python
df['num_zero'] = (df == 0).sum(axis=1)
```

add feature: binary value for each feature indicating whether a data point is null
```python
for c in df:
  if pd.isnull(df[c]).any():
    df[c+'-ISNULL'] = pd.isnull(df[c])
```

add feature interactions
```python
from sklearn.preprocessing import PolynomialFeatures

# e.g., 2nd order interaction
poly = PolynomialFeatures(degree=2)
# numpy array of transformed df
arr = poly.fit_transform(df)
# all features names
target_feature_names = ['x'.join(
    ['{}^{}'.format(pair[0], pair[1]) for pair in tuple if pair[1] != 0])
    for tuple in [zip(X_train.columns, p) for p in poly.powers_]]
new_df = pd.DataFrame(arr, columns=target_feature_names)
```

### Selection

There are [various ways](http://scikit-learn.org/stable/modules/feature_selection.html)
to select features, and an effective one is recursive feature elimination (RFE).

select feature using RFE

```python
from sklearn.feature_selection import RFE

model = ... # a sklean's classifier that has either 'coef_' or 'feature_importances_' attribute
num_feautre = 10 # say we want the top 10 features

selector = RFE(model, num_feature, step=1)
selector.fit(X_train, y_train) # select features
feature_selected = list(X_train.columns[selector.support_])

model.fit(X_train[feature_selected], y_train) # re-train a model using only selected features
```

For more feature engineering methods please refer to this
[blogpost](https://www.linkedin.com/pulse/feature-engineering-data-scientists-secret-sauce-ashish-kumar).


## Machine learning

### Cross validation (CV) strategy

Theories first (some adopted from Andrew Ng). In machine learning we usually have the following subsets of data:

* **training set** is used to run the learning algorithm on
* **dev set** (or hold out cross validation set) is used to tune parameters,
select features, and make other decisions regarding the learning algorithm
* **test set** is used to evaluate the performance of the algorithms,
but NOT to make any decisions about what algorithms or parameters to use

Ideally, those 3 sets should come from the same distribution, and
reflect what data you expect to get in the future and want to do well on.

If we have real-world application from which we continuously collect new data,
then we can train on historical data, and split the in-coming data into dev and
test sets. This is out of the scope of this cheatsheet. The following exmample
assume we have a csv file and we want to train a best model on this snapshot.

How should we split the the three sets? Here is one good CV strategy

* training set the larger the merrier of course :)
* dev set should be large enough to detect differences between algorithms
(e.g., classifier A has 90% accuracy and classifier B has 90.1% then a dev
set of 100 examples would not be able to detect this 0.1% difference.
Something around the 1,000 to 10,000 will do)
* test set should be large enough to give high confidence in the overall
performance of the system (do not naively use 30% of the data)

Sometimes we can be pretty data strapped (e.g., 1000 data points), and a compromising
strategy is 70%/15%/15% for train/dev/test sets, as follows:

```python
from sklearn.model_selection import train_test_split

# set seed for reproducibility & comparability
seed = 2017

X_train, X_other, y_train, y_other = train_test_split(
  X, y, test_size=0.3, random_state=seed)
X_dev, X_test, y_dev, y_test = train_test_split(
  X_rest, y_rest, test_size=0.5, random_state=seed)
```

As noted we need to seed the split.

If we have class imbalance issue, we should split the data in a stratified way
(using the label array):

```python
X_train, X_other, y_train, y_other = train_test_split(
  X, y, test_size=0.3, random_state=seed, stratify=y)
```

### Model training

If we've got so far, training is actually the easier part. We just initialize
a classifier and train it!

```python
from sklearn.linear_model import LogisticRegression

clf = LogisticRegression()
clf.fit(X_train, X_test)
```

### Evaluation

Having a single-number evaluation metric allows us to sort all models according
to their performance on this metric and quickly decide what is working best.
In production system if we have multiple (N) evaluation metrics, we can set
N-1 of the criteria as 'satisficing' metrics, i.e., we simply require that
they meet a certain value, then define the final one as the 'optimizing' metric
which we directly optimize.

Here is an example of evaluating a model with Area Under the Curve (AUC)

```python
from sklearn.metrics import roc_auc_score

y_pred = clf.predict(X_test)
print 'ROC score: {}'.format(roc_auc_score(y_test, y_pred))
```

### Hyperparameter tuning

example of nested cross-validation

```python
import numpy as np
from sklearn.grid_search import GridSearchCV
from sklearn.cross_validation import cross_val_score
from sklearn.ensemble import RandomForestClassifier

X_train = ... # your training features
y_train = ... # your training labels

gs = GridSearchCV(
  estimator = RandomForestClassifier(random_state=0),
  param_grid = {
    'n_estimators': [100, 200, 400, 600, 800],
     # other params to tune
     }
  scoring = 'roc_auc',
  cv = 5
)

scores = cross_val_score(
  gs,
  X_train,
  y_train,
  scoring = 'roc_auc',
  cv = 2
)

print 'CV roc_auc: %.3f +/- %.3f' % (np.mean(scores), np.std(scores))
```

### Ensemble

Please refer to the last section of this
[blogpost](https://shuaiw.github.io/2016/07/19/data-science-project-workflow.html).