[
  {
    "path": "README.md",
    "content": "# ML Cheatsheet\n\nAssuming we have the dataset in a loadable format (e.g., csv), here are the steps\nwe follow to complete a machine learning project.\n\n0. [Exploratory data analysis](#exploratory-data-analysis)\n0. [Preprocessing](#preprocessing)\n0. [Feature engineering](#feature-engineering)\n0. [Machine learning](#machine-learning)\n\nA couple of notes before we go on.\n\nFirst of all, machine learning is a highly iterative field. This would entail\na loop cycle of the above steps, where each cycle is based on the feedback from\nthe previous cycle, with the goal of improving the model performance. One\nexample is that we need refit models when we engineered new features, and\ntest to see if these features are predicative.\n\nSecond, while in Kaggle competitions one can create a monster ensemble of models, in\nproduction system often times such ensembles are not useful. They are\nhigh maintenance, hard to interpret, and too complex to deploy. This is why\nin practice it's often simpler model plus huge amount of data that wins.\n\nThird, while some code snippets are reusable, each dataset has its own\nuniqueness. Dataset-specific efforts are needed to build better models.\n\nBearing these points in mind, let's get our hands dirty.\n\n\n## Exploratory data analysis\n\nExploratory data analysis (EDA) is an approach to analyze data sets to summarize their main characteristics,\noften with plots. The goal of EDA is to get a deeper understanding of the\ndataset, and to preprocess data and engineer features more effectively. Here\nare some generic code snippets that can be applied to any structured dataset\n\n\nimport libraries\n```python\nimport os\nimport fnmatch\nimport glob\nimport numpy as np\nimport pandas as pd\nimport matplotlib.pyplot as plt\nimport seaborn as sns\n```\n\ndata I/O\n```python\ndf = pd.read_csv(file_path) # read in csv file as a DataFrame\ndf.to_csv(file_path, index=False) # save a DataFrame as csv file\n\n# read all csv under a folder and concatenate them into a big dataframe\npath = r'path'\n\n# flat\nall_files = glob.glob(os.path.join(path, \"*.csv\"))\n\n# or recursively\nall_files = [os.path.join(root, filename)\n             for root, dirnames, filenames in os.walk(path)\n             for filename in fnmatch.filter(filenames, '*.csv')]\n             \ndf = pd.concat((pd.read_csv(f) for f in all_files))\n```\n\ndata I/O zipped\n\n```python\nimport pandas as pd\nimport zipfile\n\nzf_path = 'file.zip'\nzf = zipfile.ZipFile(zf_path) # zipfile.ZipFile object\nall_files = zf.namelist() # list all zipped files\nall_files = [f for f in all_files if f.endswith('.csv')] # e.g., get only csv\ndf = pd.concat((pd.read_csv(zf.open(f)) for f in all_files)) # concat all zipped csv into one dataframe\n```\n\nTo a table in sqlite3 DB (then you can use [DB Browser for SQLite](http://sqlitebrowser.org/) to view and query the table)\n\n```python\nimport sqlite3\nimport pandas as pd\n\ndf = pd.read_csv(csv_file) # read csv file\nsqlite_file = 'my_db.sqlite3'\nconn = sqlite3.connect(sqlite_file) # establish a sqlite3 connection\n\n# if db file exists append the csv\ndf.to_sql(tablename, conn, if_exists='append', index=False)\n```\n\n\ndata summary\n```python\ndf.head() # return the first 5 rows\ndf.describe() # summary statistics, excluding NaN values\ndf.info(verbose=True, null_counts=True) # concise summary of the table\ndf.shape # shape of dataset\ndf.skew() # skewness for numeric columns\ndf.kurt() # unbiased kurtosis for numeric columns\ndf.get_dtype_counts() # counts of dtypes\n```\n\ndisplay missing value proportion for each col\n```python\nfor c in df.columns:\n  num_na = df[c].isnull().sum()\n  if num_na > 0:\n    print round(num_na / float(len(df)), 3), '|', c\n```\n\n\npairwise correlation of columns\n```python\ndf.corr()\n```\n\n\n### plotting\n\nplot heatmap of correlation matrix (of all numeric columns)\n```python\ncm = np.corrcoef(df.T)\nsns.heatmap(cm, annot=True, yticklabels=df.columns, xticklabels=df.columns)\n```\n\n![heat-corr](/assets/heat-corr.png)\n\nplot univariate distributions\n```python\n# single column\nsns.distplot(df['col1'].dropna())\n\n# all numeric columns\nfor c in df.columns:\n  if df[c].dtype in ['int64', 'float64']:\n    sns.distplot(df[c].dropna(), kde=False)\n    plt.show()\n```\n\n![hist](/assets/hist.png)\n\nplot kernel density estimaton (KED)\n```python\n# all continuous variables\nfor c in df.columns:\n  if df[c].dtype in ['float64']:\n    sns.kdeplot(df[c].dropna(), shade=True)\n    plt.show()\n```\n\n![kde](/assets/kde.png)\n\nplot pairwise relationships\n```python\nsns.pairplot(df.dropna())\n```\n\n![pairwise](/assets/pairwise.png)\n\n**[hypertools](https://github.com/ContextLab/hypertools) is a python toolbox for\nvisualizing and manipulating high-dimensional data. This is desirable for the EDA\nphase.**\n\nvisually explore relationship between features and target (in 3D space)\n```python\nimport hypertools as hyp\nimport seaborn as sns\nfrom sklearn import datasets\n\niris = datasets.load_iris()\nX = iris.data\ny = iris.target\nhyp.plot(X,'o', group=y, legend=list(set(y)), normalize='across')\n```\n\n![hypertools](/assets/hypertools.png)\n\n\nlinear regression analysis using each PC\n```python\nfrom sklearn import linear_model\nsns.set(style=\"darkgrid\")\nsns.set_palette(palette='Set2')\n\ndata = pd.DataFrame(data=X, columns=iris.feature_names)\nreduced_data = hyp.reduce(hyp.tools.df2mat(data), ndims=3)\n\nlinreg = linear_model.LinearRegression()\nlinreg.fit(reduced_data, y)\n\nsns.regplot(x=reduced_data[:,0],y=linreg.predict(reduced_data), label='PC1',x_bins=10)\nsns.regplot(x=reduced_data[:,1],y=linreg.predict(reduced_data), label='PC2',x_bins=10)\nsns.regplot(x=reduced_data[:,2],y=linreg.predict(reduced_data), label='PC3',x_bins=10)\n\nplt.title('Correlation between PC and Regression Output')\nplt.xlabel('PC Value')\nplt.ylabel('Regression Output')\nplt.legend()\nplt.show()\n```\n\n![lg-pc](/assets/lg-pc.png)\n\nbreak down by labels\n```python\nsns.set(style=\"darkgrid\")\nsns.swarmplot(y,reduced_data[:,0],order=[0, 1, 2])\nplt.title('Correlation between PC1 and target')\nplt.xlabel('Target')\nplt.ylabel('PC1 Value')\nplt.show()\n```\n\n![by-label](/assets/by-label.png)\n\nFor more use cases of hypertools, check [notebooks](https://github.com/ContextLab/hypertools-paper-notebooks)\n and [examples](http://hypertools.readthedocs.io/en/latest/auto_examples/index.html)\n\n\n## Preprocessing\n\ndrop columns\n```python\ndf.drop([col1, col2, ...], axis=1, inplace=True) # in place\nnew_df = df.drop([col1, col2, ...], axis=1) # create new df (overhead created)\n```\n\nhandle missing values\n```python\n# fill with mode, mean, or median\ndf_mode, df_mean, df_median = df.mode().iloc[0], df.mean(), df.median()\n\ndf_fill_mode = df.fillna(df_mode)\ndf_fill_mean = df.fillna(df_mean)\ndf_fill_median = df.fillna(df_median)\n\n# drop col with any missing values\ndf_drop_na_col = df.dropna(axis=1)\n```\n\nencode categorical features\n```python\nfrom sklearn.preprocessing import LabelEncoder\n\ndf_col = df.columns\ncol_non_num = [c for c in df_col if df[c].dtype == 'object']\nfor c in col_non_num:\n  df[c] = LabelEncoder().fit_transform(df[c])\n```\n\n\njoin two tables/dataframes\n```python\ndf1.join(df2, on=col)\n```\n\nhandle outliners (outliers can either be clipped, or removed.\n**WARNING**: outliers are not always meant to be removed)\n\nIn the following example we assume df is all numeric, and has no missing values\n\nclipping\n```python\n# clip outliers to 3 standard deviation\nlower = df.mean() - df.std()*3\nupper = df.mean() + df.std()*3\nclipped_df = df.clip(lower, upper, axis=1)\n```\n\nremoval\n```pyhton\n# remove rows that have outliers in at least one column\nnew_df = df[(np.abs(stats.zscore(df)) < 3).all(axis=1)]\n```\n\nfilter\n```python\n# filter by one value\nnew_df = df[df.col==val]\n\n# filter by multiple values\nnew_df = df[df.col.isin(val_list)]\n```\n\n## Feature engineering\n\n### Transformation\n\none-hot encode categorical features; not necessary for tree-based algorithms\n```python\n# for a couple of columns\none_hot_df = pd.get_dummies(df[[col1, col2, ...]])\n\n# for the whole dataframe\nnew_df = pd.get_dummies(df)\n```\n\nnormalize numeric features (to range [0, 1])\n```python\nfrom sklearn.preprocessing import MinMaxScaler\n\nscaler = MinMaxScaler()\nnormalized_df = MinMaxScaler().fit_transform(df)\n```\n\nlog transformation: for columns with highly skewed distribution, we can apply\nthe log transformation\n```python\nfrom scipy.special import log1p\ntransformed_col = df[col].apply(log1p)\n```\n\n![log](/assets/log.png)\n\n\n### Creation\n\nFeature creation is both domain and engineering efforts. With the help from\ndomain experts, we can craft more predicative features, but here are some generic\nfeature creation methods worth trying on any structured dataset\n\nadd feature: number of missing values\n```python\ndf['num_null'] = df.isnull().sum(axis=1)\n```\n\nadd feature: number of zeros\n```python\ndf['num_zero'] = (df == 0).sum(axis=1)\n```\n\nadd feature: binary value for each feature indicating whether a data point is null\n```python\nfor c in df:\n  if pd.isnull(df[c]).any():\n    df[c+'-ISNULL'] = pd.isnull(df[c])\n```\n\nadd feature interactions\n```python\nfrom sklearn.preprocessing import PolynomialFeatures\n\n# e.g., 2nd order interaction\npoly = PolynomialFeatures(degree=2)\n# numpy array of transformed df\narr = poly.fit_transform(df)\n# all features names\ntarget_feature_names = ['x'.join(\n    ['{}^{}'.format(pair[0], pair[1]) for pair in tuple if pair[1] != 0])\n    for tuple in [zip(X_train.columns, p) for p in poly.powers_]]\nnew_df = pd.DataFrame(arr, columns=target_feature_names)\n```\n\n### Selection\n\nThere are [various ways](http://scikit-learn.org/stable/modules/feature_selection.html)\nto select features, and an effective one is recursive feature elimination (RFE).\n\nselect feature using RFE\n\n```python\nfrom sklearn.feature_selection import RFE\n\nmodel = ... # a sklean's classifier that has either 'coef_' or 'feature_importances_' attribute\nnum_feautre = 10 # say we want the top 10 features\n\nselector = RFE(model, num_feature, step=1)\nselector.fit(X_train, y_train) # select features\nfeature_selected = list(X_train.columns[selector.support_])\n\nmodel.fit(X_train[feature_selected], y_train) # re-train a model using only selected features\n```\n\nFor more feature engineering methods please refer to this\n[blogpost](https://www.linkedin.com/pulse/feature-engineering-data-scientists-secret-sauce-ashish-kumar).\n\n\n## Machine learning\n\n### Cross validation (CV) strategy\n\nTheories first (some adopted from Andrew Ng). In machine learning we usually have the following subsets of data:\n\n* **training set** is used to run the learning algorithm on\n* **dev set** (or hold out cross validation set) is used to tune parameters,\nselect features, and make other decisions regarding the learning algorithm\n* **test set** is used to evaluate the performance of the algorithms,\nbut NOT to make any decisions about what algorithms or parameters to use\n\nIdeally, those 3 sets should come from the same distribution, and\nreflect what data you expect to get in the future and want to do well on.\n\nIf we have real-world application from which we continuously collect new data,\nthen we can train on historical data, and split the in-coming data into dev and\ntest sets. This is out of the scope of this cheatsheet. The following exmample\nassume we have a csv file and we want to train a best model on this snapshot.\n\nHow should we split the the three sets? Here is one good CV strategy\n\n* training set the larger the merrier of course :)\n* dev set should be large enough to detect differences between algorithms\n(e.g., classifier A has 90% accuracy and classifier B has 90.1% then a dev\nset of 100 examples would not be able to detect this 0.1% difference.\nSomething around the 1,000 to 10,000 will do)\n* test set should be large enough to give high confidence in the overall\nperformance of the system (do not naively use 30% of the data)\n\nSometimes we can be pretty data strapped (e.g., 1000 data points), and a compromising\nstrategy is 70%/15%/15% for train/dev/test sets, as follows:\n\n```python\nfrom sklearn.model_selection import train_test_split\n\n# set seed for reproducibility & comparability\nseed = 2017\n\nX_train, X_other, y_train, y_other = train_test_split(\n  X, y, test_size=0.3, random_state=seed)\nX_dev, X_test, y_dev, y_test = train_test_split(\n  X_rest, y_rest, test_size=0.5, random_state=seed)\n```\n\nAs noted we need to seed the split.\n\nIf we have class imbalance issue, we should split the data in a stratified way\n(using the label array):\n\n```python\nX_train, X_other, y_train, y_other = train_test_split(\n  X, y, test_size=0.3, random_state=seed, stratify=y)\n```\n\n### Model training\n\nIf we've got so far, training is actually the easier part. We just initialize\na classifier and train it!\n\n```python\nfrom sklearn.linear_model import LogisticRegression\n\nclf = LogisticRegression()\nclf.fit(X_train, X_test)\n```\n\n### Evaluation\n\nHaving a single-number evaluation metric allows us to sort all models according\nto their performance on this metric and quickly decide what is working best.\nIn production system if we have multiple (N) evaluation metrics, we can set\nN-1 of the criteria as 'satisficing' metrics, i.e., we simply require that\nthey meet a certain value, then define the final one as the 'optimizing' metric\nwhich we directly optimize.\n\nHere is an example of evaluating a model with Area Under the Curve (AUC)\n\n```python\nfrom sklearn.metrics import roc_auc_score\n\ny_pred = clf.predict(X_test)\nprint 'ROC score: {}'.format(roc_auc_score(y_test, y_pred))\n```\n\n### Hyperparameter tuning\n\nexample of nested cross-validation\n\n```python\nimport numpy as np\nfrom sklearn.grid_search import GridSearchCV\nfrom sklearn.cross_validation import cross_val_score\nfrom sklearn.ensemble import RandomForestClassifier\n\nX_train = ... # your training features\ny_train = ... # your training labels\n\ngs = GridSearchCV(\n  estimator = RandomForestClassifier(random_state=0),\n  param_grid = {\n    'n_estimators': [100, 200, 400, 600, 800],\n     # other params to tune\n     }\n  scoring = 'roc_auc',\n  cv = 5\n)\n\nscores = cross_val_score(\n  gs,\n  X_train,\n  y_train,\n  scoring = 'roc_auc',\n  cv = 2\n)\n\nprint 'CV roc_auc: %.3f +/- %.3f' % (np.mean(scores), np.std(scores))\n```\n\n### Ensemble\n\nPlease refer to the last section of this\n[blogpost](https://shuaiw.github.io/2016/07/19/data-science-project-workflow.html).\n"
  }
]