[
  {
    "path": "Jupyter_nnmodel/README.md",
    "content": "# Content: Jupyter Version of 2nd place code kaggle-porto-seguro\n## [Porto Seguro’s Safe Driver Prediction](https://www.kaggle.com/c/porto-seguro-safe-driver-prediction) in Kaggle Competition\n\n### Popose\nThe popose of this jupyter script is to make flow of code readable and easy to understand. Some code have been changed from the orignial author, but the concept of processing the 2nd place code is exactly same. For the some feature engineering processing, I put these codes into `feature_generater.py`\n\n### Install\n\nThis project requires **Python 2.7 or 3.6** and the following Python libraries installed:\n\n- [NumPy](http://www.numpy.org/)\n- [Pandas](http://pandas.pydata.org)\n- [Keras](http://matplotlib.org/)\n- [scikit-learn](http://scikit-learn.org/stable/)\n- [xgboost](https://xgboost.readthedocs.io/)\n- [pickle](https://www.tensorflow.org/)\n- [keras](https://keras.io/)\n- [itertools](https://docs.python.org/2/library/itertools.html)\n\nYou will also need to have software installed to run and execute a [Jupyter Notebook](http://ipython.org/notebook.html)\n\nIf you do not have Python installed yet, it is highly recommended that you install the [Anaconda](http://continuum.io/downloads) distribution of Python, which already has the above packages and more included. Make sure that you select the Python 2.7 installer and not the Python 3.x installer. \n\n### Code\n1. Run `fea_eng0.py` to get your first features as a pickle file `fea0.pk`. \n\nNote: if you are using python 3, you need to switch the code, `fea_eng0.py`, in the last line. So the pickle file would be able to read in `nn_model.ipynb`.\n \n2. the manin code is provided in the `nn_model.ipynb` notebook file. You will also be required to use the included `util.py` and `feature_generater.py` Python files, the `train.csv` and `test.csv` dataset file,which you have to download from [Kaggle](https://www.kaggle.com/c/porto-seguro-safe-driver-prediction/data) into the input folder, to complete your work. While some code has already been implemented to get you started, you will need to implement additional functionality when requested to successfully complete the. During the operation of `nn_model.ipynb` in , the defualt output will be created in model folder. If you are interested in `util.py` and `feature_generater.py`, please feel free to explore these Python files. \n\n\n### Run\n\nIn a terminal or command window, navigate to the top-level project directory `Jupyter_Version/` (that contains this README) and run one of the following commands:\n\n```bash\nipython notebook nn_model.ipynb\n```  \nor\n```bash\njupyter notebook nn_model.ipynb\n```\n\nThis will open the Jupyter Notebook software and project file in your browser.\n\n## Data\n\nYou can download data from [Kaggle](https://www.kaggle.com/c/porto-seguro-safe-driver-prediction/data)\nIn the train and test data, features that belong to similar groupings are tagged as such in the feature names (e.g., ind, reg, car, calc). In addition, feature names include the postfix bin to indicate binary features and cat to indicate categorical features. Features without these designations are either continuous or ordinal. Values of -1 indicate that the feature was missing from the observation. The target columns signifies whether or not a claim was filed for that policy holder.\n\n\n"
  },
  {
    "path": "Jupyter_nnmodel/fea_eng0.py",
    "content": "\"\"\"xgb prediction as features\"\"\"\nimport xgboost as xgb\nfrom sklearn.model_selection import KFold\nimport numpy as np\nimport pandas as pd\n\neta = 0.1\nmax_depth = 6\nsubsample = 0.9\ncolsample_bytree = 0.85\nmin_child_weight = 55\nnum_boost_round = 500\n\ntrain = pd.read_csv(\"../input/train.csv\")\ntrain_label = train['target']\ntrain_id = train['id']\ndel train['target'], train['id']\n\ntest = pd.read_csv(\"../input/test.csv\")\ntest_id = test['id']\ndel test['id']\n\n\nparams = {\"objective\": \"reg:linear\",\n          \"booster\": \"gbtree\",\n          \"eta\": eta,\n          \"max_depth\": int(max_depth),\n          \"subsample\": subsample,\n          \"colsample_bytree\": colsample_bytree,\n          \"min_child_weight\": min_child_weight,\n          \"silent\": 1\n          }\n\ndata = train.append(test)\ndata.reset_index(inplace=True)\ntrain_rows = train.shape[0]\n\nfeature_results = []\n\nfor target_g in ['car', 'ind', 'reg']:\n    features = [x for x in list(data) if target_g not in x]\n    target_list = [x for x in list(data) if target_g in x]\n    train_fea = np.array(data[features])\n    for target in target_list:\n        print(target)\n        train_label = data[target]\n        kfold = KFold(n_splits=5, random_state=218, shuffle=True)\n        kf = kfold.split(data)\n        cv_train = np.zeros(shape=(data.shape[0], 1))\n        for i, (train_fold, validate) in enumerate(kf):\n            X_train, X_validate, label_train, label_validate = \\\n                train_fea[train_fold, :], train_fea[validate, :], train_label[train_fold], train_label[validate]\n            dtrain = xgb.DMatrix(X_train, label_train)\n            dvalid = xgb.DMatrix(X_validate, label_validate)\n            watchlist = [(dtrain, 'train'), (dvalid, 'valid')]\n            bst = xgb.train(params, dtrain, num_boost_round, evals=watchlist, verbose_eval=50,\n                            early_stopping_rounds=10)\n            cv_train[validate, 0] += bst.predict(xgb.DMatrix(X_validate), ntree_limit=bst.best_ntree_limit)\n        feature_results.append(cv_train)\n\nfeature_results = np.hstack(feature_results)\ntrain_features = feature_results[:train_rows, :]\ntest_features = feature_results[train_rows:, :]\n\nimport pickle\n#for python 2\npickle.dump([train_features, test_features], open(\"../input/fea0.pk\", 'wb'),protocol=2)\n#for python 3 \n# pickle.dump([train_features, test_features], open(\"../input/fea0.pk\", 'wb'),protocol=3)\n\n"
  },
  {
    "path": "Jupyter_nnmodel/feature_generater.py",
    "content": "from util import proj_num_on_cat, Gini, interaction_features\nfrom itertools import combinations\nimport numpy as np\nimport pandas as pd\n\ndef Multiply_Divide(train, test, features):\n    \"\"\"\n    combinations:\n    combinations(['A', 'B','C'],2)  retrun AB AC BC\n    combinations(range(4), 3) --> 012 013 023 123\n    \"\"\"\n    feature_names= []\n    for e, (x, y) in enumerate(combinations(features, 2)):\n        train, test, feature_name= interaction_features(train, test, x, y, e)\n        for name in feature_name:\n            feature_names.append(name)\n\n\n    return train, test, feature_names\n\n\n\n\ndef Series_string(train, test, category_list):\n    '''\n    produce series as a string like new_ind as the following\n   id        new_ind_count                    new_ind\n595207            117       3_1_10_0_0_0_0_0_1_0_0_0_0_0_13_1_0_0\n595208            153       5_1_3_0_0_0_0_0_1_0_0_0_0_0_6_1_0_0\n\n    return train and test with new colunes of new_categories \n    '''\n    for category in category_list:\n\n        feature_names = list(train.columns)\n\n        features = [c for c in feature_names if category in c]\n        name= 'new_'+ category\n\n\n        count = 0\n        for c in features:\n            if count == 0:\n                train[name] = train[c].astype(str)\n                count += 1\n            else:\n                train[name] += '_' + train[c].astype(str)\n\n        count = 0\n        for c in features:\n            if count == 0:\n                test[name] = test[c].astype(str)\n                count += 1\n            else:\n                test[name] += '_' + test[c].astype(str)\n\n    return train, test\n\n\n\ndef Features_Counts(train, test, features):\n    feature_names =[]\n\n    for c in features:\n        d = pd.concat([train[c],test[c]]).value_counts().to_dict()\n        train['%s_count'%c] = train[c].apply(lambda x:d.get(x,0))\n        test['%s_count'%c] = test[c].apply(lambda x:d.get(x,0))\n        feature_names.append('%s_count'%c)\n\n    return train, test, feature_names\n\n\ndef Statistic_features(train, test, target_features, group_features):\n    train_list_=[]\n    test_list_=[]\n    for t in target_features:\n        for g in group_features:\n            if t != g:\n                s_train, s_test = proj_num_on_cat(train, test, target_column=t, group_column=g)\n                train_list_.append(s_train)\n                test_list_.append(s_test)\n    return np.hstack(train_list_), np.hstack(test_list_)\n\ndef features_type(train):\n    data = []\n    for f in train.columns:\n        \n        # Defining the level\n        if 'bin' in f or f == 'target':\n            level = 'binary'\n        elif 'cat' in f or f == 'id':\n            level = 'nominal'\n        elif train[f].dtype == float:\n            level = 'interval'\n        elif train[f].dtype == int:\n            level = 'ordinal'\n            \n        # Initialize keep to True for all variables except for id\n        keep = True\n        if f == 'id':\n            keep = False\n        \n        # Defining the data type \n        dtype = train[f].dtype\n        \n        # Creating a Dict that contains all the metadata for the variable\n        f_dict = {\n            'varname': f,\n            'level': level,\n            'keep': keep,\n            'dtype': dtype\n        }\n        data.append(f_dict)\n    meta = pd.DataFrame(data, columns=['varname', 'level', 'keep', 'dtype'])\n    meta.set_index('varname', inplace=True)\n    interval = meta[(meta.level == 'interval') & (meta.keep)].index\n    ordinal = meta[(meta.level == 'ordinal') & (meta.keep)].index\n    binary = meta[(meta.level == 'binary') & (meta.keep)].index\n    nominal  = meta[(meta.level == 'nominal') & (meta.keep)].index\n    return interval, ordinal, binary, nominal\n\n\n\n\n"
  },
  {
    "path": "Jupyter_nnmodel/nn_model .ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 1,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"/Users/stevenhu/anaconda3/lib/python3.6/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.\\n\",\n      \"  from ._conv import register_converters as _register_converters\\n\",\n      \"Using TensorFlow backend.\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"#homemade script\\n\",\n    \"from util import Gini\\n\",\n    \"from feature_generater import Multiply_Divide, Series_string, Features_Counts, Statistic_features\\n\",\n    \"\\n\",\n    \"import numpy as np\\n\",\n    \"import pandas as pd\\n\",\n    \"import pickle\\n\",\n    \"from scipy import sparse\\n\",\n    \"from sklearn.preprocessing import StandardScaler\\n\",\n    \"from sklearn.preprocessing import LabelEncoder\\n\",\n    \"\\n\",\n    \"#for NN model\\n\",\n    \"from keras.layers import Dense, Dropout, Embedding, Flatten, Input, Concatenate, merge\\n\",\n    \"from keras.layers.normalization import BatchNormalization\\n\",\n    \"from keras.layers.advanced_activations import PReLU\\n\",\n    \"from keras.models import Model\\n\",\n    \"from time import time\\n\",\n    \"import datetime\\n\",\n    \"from sklearn.model_selection import StratifiedKFold\\n\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# 1. Load Data #\\n\",\n    \"\\n\",\n    \"- Create train, test dataset\\n\",\n    \"- Create train target label\\n\",\n    \"- Create feature object: cat, num, bin, inter\\n\",\n    \"- Create feature columns in train: counting of miss values\\n\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 2,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"cv_only = True\\n\",\n    \"save_cv = True\\n\",\n    \"\\n\",\n    \"#read data\\n\",\n    \"train = pd.read_csv(\\\"../input/train.csv\\\")\\n\",\n    \"train_label = train['target']\\n\",\n    \"train_id = train['id']\\n\",\n    \"del train['target'], train['id']\\n\",\n    \"\\n\",\n    \"test = pd.read_csv(\\\"../input/test.csv\\\")\\n\",\n    \"test_id = test['id']\\n\",\n    \"del test['id']\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"#find missing value by each row and recode to column 'missing'\\n\",\n    \"train['missing'] = (train==-1).sum(axis=1).astype(float)\\n\",\n    \"test['missing'] = (test==-1).sum(axis=1).astype(float)\\n\",\n    \"\\n\",\n    \"#get all featrue name\\n\",\n    \"feature_names = list(train)\\n\",\n    \"\\n\",\n    \"# extract feature with cat, bin, num, inter\\n\",\n    \"cat_fea = [x for x in list(train) if 'cat' in x]\\n\",\n    \"bin_fea = [x for x in list(train) if 'bin' in x]\\n\",\n    \"num_features = [c for c in list(train) if ('cat' not in c and 'calc' not in c)]\\n\",\n    \"inter_fea = [x for x in list(train) if 'inter' in x]\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# 2. Feature Engineering #\\n\",\n    \"\\n\",\n    \"- Create Multipy and Divide feature\\n\",\n    \"- Feature of Counts(target incoding)\\n\",\n    \"- Load feature generated from Feature Engine\\n\",\n    \"- Create Statistic features\\n\",\n    \"- Combine all feature together, and get ready for training\\n\",\n    \"- Create Cat_feature for NN embeding training\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## 2.1 Multipy and Divide feature ##\\n\",\n    \"- moltipy each feature in the list and created new columns into train and testing data set\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 3,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"#Add features of Multiply and Divide\\n\",\n    \"features= ['ps_car_13', 'ps_ind_03', 'ps_reg_03', 'ps_ind_15', 'ps_reg_01', 'ps_ind_01']\\n\",\n    \"train, test, MD_features = Multiply_Divide(train, test, features)\\n\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 4,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/html\": [\n       \"<div>\\n\",\n       \"<style scoped>\\n\",\n       \"    .dataframe tbody tr th:only-of-type {\\n\",\n       \"        vertical-align: middle;\\n\",\n       \"    }\\n\",\n       \"\\n\",\n       \"    .dataframe tbody tr th {\\n\",\n       \"        vertical-align: top;\\n\",\n       \"    }\\n\",\n       \"\\n\",\n       \"    .dataframe thead th {\\n\",\n       \"        text-align: right;\\n\",\n       \"    }\\n\",\n       \"</style>\\n\",\n       \"<table border=\\\"1\\\" class=\\\"dataframe\\\">\\n\",\n       \"  <thead>\\n\",\n       \"    <tr style=\\\"text-align: right;\\\">\\n\",\n       \"      <th></th>\\n\",\n       \"      <th>inter_0*</th>\\n\",\n       \"      <th>inter_0/</th>\\n\",\n       \"      <th>inter_1*</th>\\n\",\n       \"      <th>inter_1/</th>\\n\",\n       \"      <th>inter_2*</th>\\n\",\n       \"      <th>inter_2/</th>\\n\",\n       \"      <th>inter_3*</th>\\n\",\n       \"      <th>inter_3/</th>\\n\",\n       \"      <th>inter_4*</th>\\n\",\n       \"      <th>inter_4/</th>\\n\",\n       \"      <th>...</th>\\n\",\n       \"      <th>inter_10*</th>\\n\",\n       \"      <th>inter_10/</th>\\n\",\n       \"      <th>inter_11*</th>\\n\",\n       \"      <th>inter_11/</th>\\n\",\n       \"      <th>inter_12*</th>\\n\",\n       \"      <th>inter_12/</th>\\n\",\n       \"      <th>inter_13*</th>\\n\",\n       \"      <th>inter_13/</th>\\n\",\n       \"      <th>inter_14*</th>\\n\",\n       \"      <th>inter_14/</th>\\n\",\n       \"    </tr>\\n\",\n       \"  </thead>\\n\",\n       \"  <tbody>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>0</th>\\n\",\n       \"      <td>4.418395</td>\\n\",\n       \"      <td>0.176736</td>\\n\",\n       \"      <td>0.634544</td>\\n\",\n       \"      <td>1.230630</td>\\n\",\n       \"      <td>9.720468</td>\\n\",\n       \"      <td>0.080334</td>\\n\",\n       \"      <td>0.618575</td>\\n\",\n       \"      <td>1.262398</td>\\n\",\n       \"      <td>1.767358</td>\\n\",\n       \"      <td>0.441839</td>\\n\",\n       \"      <td>...</td>\\n\",\n       \"      <td>0.502649</td>\\n\",\n       \"      <td>1.025815</td>\\n\",\n       \"      <td>1.436141</td>\\n\",\n       \"      <td>0.359035</td>\\n\",\n       \"      <td>7.7</td>\\n\",\n       \"      <td>15.714286</td>\\n\",\n       \"      <td>22</td>\\n\",\n       \"      <td>5.500000</td>\\n\",\n       \"      <td>1.4</td>\\n\",\n       \"      <td>0.350000</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>1</th>\\n\",\n       \"      <td>4.331716</td>\\n\",\n       \"      <td>0.088402</td>\\n\",\n       \"      <td>0.474062</td>\\n\",\n       \"      <td>0.807773</td>\\n\",\n       \"      <td>1.856450</td>\\n\",\n       \"      <td>0.206272</td>\\n\",\n       \"      <td>0.495053</td>\\n\",\n       \"      <td>0.773521</td>\\n\",\n       \"      <td>0.618817</td>\\n\",\n       \"      <td>0.618817</td>\\n\",\n       \"      <td>...</td>\\n\",\n       \"      <td>0.612862</td>\\n\",\n       \"      <td>0.957597</td>\\n\",\n       \"      <td>0.766078</td>\\n\",\n       \"      <td>0.766078</td>\\n\",\n       \"      <td>2.4</td>\\n\",\n       \"      <td>3.750000</td>\\n\",\n       \"      <td>3</td>\\n\",\n       \"      <td>3.000000</td>\\n\",\n       \"      <td>0.8</td>\\n\",\n       \"      <td>0.800000</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>2</th>\\n\",\n       \"      <td>5.774271</td>\\n\",\n       \"      <td>0.071287</td>\\n\",\n       \"      <td>-0.641586</td>\\n\",\n       \"      <td>-0.641586</td>\\n\",\n       \"      <td>7.699029</td>\\n\",\n       \"      <td>0.053465</td>\\n\",\n       \"      <td>0.000000</td>\\n\",\n       \"      <td>inf</td>\\n\",\n       \"      <td>3.207929</td>\\n\",\n       \"      <td>0.128317</td>\\n\",\n       \"      <td>...</td>\\n\",\n       \"      <td>-0.000000</td>\\n\",\n       \"      <td>-inf</td>\\n\",\n       \"      <td>-5.000000</td>\\n\",\n       \"      <td>-0.200000</td>\\n\",\n       \"      <td>0.0</td>\\n\",\n       \"      <td>inf</td>\\n\",\n       \"      <td>60</td>\\n\",\n       \"      <td>2.400000</td>\\n\",\n       \"      <td>0.0</td>\\n\",\n       \"      <td>0.000000</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>3</th>\\n\",\n       \"      <td>1.085898</td>\\n\",\n       \"      <td>0.271474</td>\\n\",\n       \"      <td>0.315425</td>\\n\",\n       \"      <td>0.934592</td>\\n\",\n       \"      <td>4.343590</td>\\n\",\n       \"      <td>0.067869</td>\\n\",\n       \"      <td>0.488654</td>\\n\",\n       \"      <td>0.603276</td>\\n\",\n       \"      <td>0.000000</td>\\n\",\n       \"      <td>inf</td>\\n\",\n       \"      <td>...</td>\\n\",\n       \"      <td>0.522853</td>\\n\",\n       \"      <td>0.645497</td>\\n\",\n       \"      <td>0.000000</td>\\n\",\n       \"      <td>inf</td>\\n\",\n       \"      <td>7.2</td>\\n\",\n       \"      <td>8.888889</td>\\n\",\n       \"      <td>0</td>\\n\",\n       \"      <td>inf</td>\\n\",\n       \"      <td>0.0</td>\\n\",\n       \"      <td>inf</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>4</th>\\n\",\n       \"      <td>0.000000</td>\\n\",\n       \"      <td>inf</td>\\n\",\n       \"      <td>0.475728</td>\\n\",\n       \"      <td>0.673001</td>\\n\",\n       \"      <td>5.092484</td>\\n\",\n       \"      <td>0.062870</td>\\n\",\n       \"      <td>0.396082</td>\\n\",\n       \"      <td>0.808331</td>\\n\",\n       \"      <td>0.000000</td>\\n\",\n       \"      <td>inf</td>\\n\",\n       \"      <td>...</td>\\n\",\n       \"      <td>0.588531</td>\\n\",\n       \"      <td>1.201084</td>\\n\",\n       \"      <td>0.000000</td>\\n\",\n       \"      <td>inf</td>\\n\",\n       \"      <td>6.3</td>\\n\",\n       \"      <td>12.857143</td>\\n\",\n       \"      <td>0</td>\\n\",\n       \"      <td>inf</td>\\n\",\n       \"      <td>0.0</td>\\n\",\n       \"      <td>inf</td>\\n\",\n       \"    </tr>\\n\",\n       \"  </tbody>\\n\",\n       \"</table>\\n\",\n       \"<p>5 rows × 30 columns</p>\\n\",\n       \"</div>\"\n      ],\n      \"text/plain\": [\n       \"   inter_0*  inter_0/  inter_1*  inter_1/  inter_2*  inter_2/  inter_3*  \\\\\\n\",\n       \"0  4.418395  0.176736  0.634544  1.230630  9.720468  0.080334  0.618575   \\n\",\n       \"1  4.331716  0.088402  0.474062  0.807773  1.856450  0.206272  0.495053   \\n\",\n       \"2  5.774271  0.071287 -0.641586 -0.641586  7.699029  0.053465  0.000000   \\n\",\n       \"3  1.085898  0.271474  0.315425  0.934592  4.343590  0.067869  0.488654   \\n\",\n       \"4  0.000000       inf  0.475728  0.673001  5.092484  0.062870  0.396082   \\n\",\n       \"\\n\",\n       \"   inter_3/  inter_4*  inter_4/    ...      inter_10*  inter_10/  inter_11*  \\\\\\n\",\n       \"0  1.262398  1.767358  0.441839    ...       0.502649   1.025815   1.436141   \\n\",\n       \"1  0.773521  0.618817  0.618817    ...       0.612862   0.957597   0.766078   \\n\",\n       \"2       inf  3.207929  0.128317    ...      -0.000000       -inf  -5.000000   \\n\",\n       \"3  0.603276  0.000000       inf    ...       0.522853   0.645497   0.000000   \\n\",\n       \"4  0.808331  0.000000       inf    ...       0.588531   1.201084   0.000000   \\n\",\n       \"\\n\",\n       \"   inter_11/  inter_12*  inter_12/  inter_13*  inter_13/  inter_14*  inter_14/  \\n\",\n       \"0   0.359035        7.7  15.714286         22   5.500000        1.4   0.350000  \\n\",\n       \"1   0.766078        2.4   3.750000          3   3.000000        0.8   0.800000  \\n\",\n       \"2  -0.200000        0.0        inf         60   2.400000        0.0   0.000000  \\n\",\n       \"3        inf        7.2   8.888889          0        inf        0.0        inf  \\n\",\n       \"4        inf        6.3  12.857143          0        inf        0.0        inf  \\n\",\n       \"\\n\",\n       \"[5 rows x 30 columns]\"\n      ]\n     },\n     \"metadata\": {},\n     \"output_type\": \"display_data\"\n    }\n   ],\n   \"source\": [\n    \"display(train[MD_features].head(5))\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## 2.2 Feature of Counts ##\\n\",\n    \"1. Generate new_ind, new_reg, new_car\\n\",\n    \"2. Count the number of distinct values of\\n\",\n    \"    cat features, new_ind, new_reg and new_car\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 4,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"\\n\",\n    \"'''\\n\",\n    \"create 1_0_1_1..... data as new_xxx\\n\",\n    \"\\n\",\n    \"new_ind: collect all data from all relative \\\"ind\\\" columns, then generate series number\\n\",\n    \"\\n\",\n    \"new_reg, new_car for train and test data \\n\",\n    \"For RNN processing, generating a sequence number\\n\",\n    \"'''\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"category_list = ['ind', 'reg', 'car']\\n\",\n    \"#add 'new_ind','new_reg','new_car' in train and test dataset\\n\",\n    \"train, test = Series_string(train,test,category_list )\\n\",\n    \"\\n\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 6,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/html\": [\n       \"<div>\\n\",\n       \"<style scoped>\\n\",\n       \"    .dataframe tbody tr th:only-of-type {\\n\",\n       \"        vertical-align: middle;\\n\",\n       \"    }\\n\",\n       \"\\n\",\n       \"    .dataframe tbody tr th {\\n\",\n       \"        vertical-align: top;\\n\",\n       \"    }\\n\",\n       \"\\n\",\n       \"    .dataframe thead th {\\n\",\n       \"        text-align: right;\\n\",\n       \"    }\\n\",\n       \"</style>\\n\",\n       \"<table border=\\\"1\\\" class=\\\"dataframe\\\">\\n\",\n       \"  <thead>\\n\",\n       \"    <tr style=\\\"text-align: right;\\\">\\n\",\n       \"      <th></th>\\n\",\n       \"      <th>new_ind</th>\\n\",\n       \"      <th>new_reg</th>\\n\",\n       \"      <th>new_car</th>\\n\",\n       \"    </tr>\\n\",\n       \"  </thead>\\n\",\n       \"  <tbody>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>0</th>\\n\",\n       \"      <td>2_2_5_1_0_0_1_0_0_0_0_0_0_0_11_0_1_0</td>\\n\",\n       \"      <td>0.7_0.2_0.7180703308</td>\\n\",\n       \"      <td>10_1_-1_0_1_4_1_0_0_1_12_2_0.4_0.8836789178_0....</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>1</th>\\n\",\n       \"      <td>1_1_7_0_0_0_0_1_0_0_0_0_0_0_3_0_0_1</td>\\n\",\n       \"      <td>0.8_0.4_0.7660776723</td>\\n\",\n       \"      <td>11_1_-1_0_-1_11_1_1_2_1_19_3_0.316227766_0.618...</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>2</th>\\n\",\n       \"      <td>5_4_9_1_0_0_0_1_0_0_0_0_0_0_12_1_0_0</td>\\n\",\n       \"      <td>0.0_0.0_-1.0</td>\\n\",\n       \"      <td>7_1_-1_0_-1_14_1_1_2_1_60_1_0.316227766_0.6415...</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>3</th>\\n\",\n       \"      <td>0_1_2_0_0_1_0_0_0_0_0_0_0_0_8_1_0_0</td>\\n\",\n       \"      <td>0.9_0.2_0.5809475019</td>\\n\",\n       \"      <td>7_1_0_0_1_11_1_1_3_1_104_1_0.3741657387_0.5429...</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>4</th>\\n\",\n       \"      <td>0_2_0_1_0_1_0_0_0_0_0_0_0_0_9_1_0_0</td>\\n\",\n       \"      <td>0.7_0.6_0.840758586</td>\\n\",\n       \"      <td>11_1_-1_0_-1_14_1_1_2_1_82_3_0.3160696126_0.56...</td>\\n\",\n       \"    </tr>\\n\",\n       \"  </tbody>\\n\",\n       \"</table>\\n\",\n       \"</div>\"\n      ],\n      \"text/plain\": [\n       \"                                new_ind               new_reg  \\\\\\n\",\n       \"0  2_2_5_1_0_0_1_0_0_0_0_0_0_0_11_0_1_0  0.7_0.2_0.7180703308   \\n\",\n       \"1   1_1_7_0_0_0_0_1_0_0_0_0_0_0_3_0_0_1  0.8_0.4_0.7660776723   \\n\",\n       \"2  5_4_9_1_0_0_0_1_0_0_0_0_0_0_12_1_0_0          0.0_0.0_-1.0   \\n\",\n       \"3   0_1_2_0_0_1_0_0_0_0_0_0_0_0_8_1_0_0  0.9_0.2_0.5809475019   \\n\",\n       \"4   0_2_0_1_0_1_0_0_0_0_0_0_0_0_9_1_0_0   0.7_0.6_0.840758586   \\n\",\n       \"\\n\",\n       \"                                             new_car  \\n\",\n       \"0  10_1_-1_0_1_4_1_0_0_1_12_2_0.4_0.8836789178_0....  \\n\",\n       \"1  11_1_-1_0_-1_11_1_1_2_1_19_3_0.316227766_0.618...  \\n\",\n       \"2  7_1_-1_0_-1_14_1_1_2_1_60_1_0.316227766_0.6415...  \\n\",\n       \"3  7_1_0_0_1_11_1_1_3_1_104_1_0.3741657387_0.5429...  \\n\",\n       \"4  11_1_-1_0_-1_14_1_1_2_1_82_3_0.3160696126_0.56...  \"\n      ]\n     },\n     \"metadata\": {},\n     \"output_type\": \"display_data\"\n    }\n   ],\n   \"source\": [\n    \"display(train[['new_ind','new_reg','new_car']].head(5))\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 5,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"'''\\n\",\n    \"count_features\\n\",\n    \"\\n\",\n    \"preparing for train[cat_count_features] \\n\",\n    \"cat_fea = \\n\",\n    \"['ps_ind_02_cat','ps_ind_04_cat','ps_ind_05_cat', 'ps_car_01_cat', 'ps_car_02_cat',\\n\",\n    \" 'ps_car_03_cat','ps_car_04_cat','ps_car_05_cat','ps_car_06_cat','ps_car_07_cat',\\n\",\n    \" 'ps_car_08_cat','ps_car_09_cat','ps_car_10_cat', 'ps_car_11_cat']\\n\",\n    \"\\n\",\n    \"Example: \\n\",\n    \"ps_ind_02_cat_count\\n\",\n    \"dictionay of ps_ind_02_cat \\n\",\n    \"([(1, 1079327), (2, 309747), (3, 70172), (4, 28259), (-1, 523)])\\n\",\n    \"\\n\",\n    \"row        count     origial value\\n\",\n    \"595202    1079327       1     \\n\",\n    \"595203     309747       2\\n\",\n    \"595204     309747       2\\n\",\n    \"595205      70172       3\\n\",\n    \"595206    1079327       1\\n\",\n    \"\\n\",\n    \"''' \\n\",\n    \"\\n\",\n    \"cat_fea = [ name for name in list(train) if 'cat' in name and 'count' not in name]\\n\",\n    \"features= cat_fea + ['new_ind','new_reg','new_car']\\n\",\n    \"\\n\",\n    \"train, test, cat_count_features= Features_Counts(train, test, features)\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"    \"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 8,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/html\": [\n       \"<div>\\n\",\n       \"<style scoped>\\n\",\n       \"    .dataframe tbody tr th:only-of-type {\\n\",\n       \"        vertical-align: middle;\\n\",\n       \"    }\\n\",\n       \"\\n\",\n       \"    .dataframe tbody tr th {\\n\",\n       \"        vertical-align: top;\\n\",\n       \"    }\\n\",\n       \"\\n\",\n       \"    .dataframe thead th {\\n\",\n       \"        text-align: right;\\n\",\n       \"    }\\n\",\n       \"</style>\\n\",\n       \"<table border=\\\"1\\\" class=\\\"dataframe\\\">\\n\",\n       \"  <thead>\\n\",\n       \"    <tr style=\\\"text-align: right;\\\">\\n\",\n       \"      <th></th>\\n\",\n       \"      <th>ps_ind_02_cat_count</th>\\n\",\n       \"      <th>ps_ind_04_cat_count</th>\\n\",\n       \"      <th>ps_ind_05_cat_count</th>\\n\",\n       \"      <th>ps_car_01_cat_count</th>\\n\",\n       \"      <th>ps_car_02_cat_count</th>\\n\",\n       \"      <th>ps_car_03_cat_count</th>\\n\",\n       \"      <th>ps_car_04_cat_count</th>\\n\",\n       \"      <th>ps_car_05_cat_count</th>\\n\",\n       \"      <th>ps_car_06_cat_count</th>\\n\",\n       \"      <th>ps_car_07_cat_count</th>\\n\",\n       \"      <th>ps_car_08_cat_count</th>\\n\",\n       \"      <th>ps_car_09_cat_count</th>\\n\",\n       \"      <th>ps_car_10_cat_count</th>\\n\",\n       \"      <th>ps_car_11_cat_count</th>\\n\",\n       \"      <th>new_ind_count</th>\\n\",\n       \"      <th>new_reg_count</th>\\n\",\n       \"      <th>new_car_count</th>\\n\",\n       \"    </tr>\\n\",\n       \"  </thead>\\n\",\n       \"  <tbody>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>0</th>\\n\",\n       \"      <td>309747</td>\\n\",\n       \"      <td>620936</td>\\n\",\n       \"      <td>1319412</td>\\n\",\n       \"      <td>124587</td>\\n\",\n       \"      <td>1234979</td>\\n\",\n       \"      <td>1028142</td>\\n\",\n       \"      <td>1241334</td>\\n\",\n       \"      <td>431560</td>\\n\",\n       \"      <td>77845</td>\\n\",\n       \"      <td>1383070</td>\\n\",\n       \"      <td>249663</td>\\n\",\n       \"      <td>486510</td>\\n\",\n       \"      <td>1475460</td>\\n\",\n       \"      <td>18326</td>\\n\",\n       \"      <td>6</td>\\n\",\n       \"      <td>24</td>\\n\",\n       \"      <td>1</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>1</th>\\n\",\n       \"      <td>1079327</td>\\n\",\n       \"      <td>866864</td>\\n\",\n       \"      <td>1319412</td>\\n\",\n       \"      <td>518725</td>\\n\",\n       \"      <td>1234979</td>\\n\",\n       \"      <td>1028142</td>\\n\",\n       \"      <td>1241334</td>\\n\",\n       \"      <td>666910</td>\\n\",\n       \"      <td>329890</td>\\n\",\n       \"      <td>1383070</td>\\n\",\n       \"      <td>1238365</td>\\n\",\n       \"      <td>883326</td>\\n\",\n       \"      <td>1475460</td>\\n\",\n       \"      <td>12535</td>\\n\",\n       \"      <td>36</td>\\n\",\n       \"      <td>38</td>\\n\",\n       \"      <td>11</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>2</th>\\n\",\n       \"      <td>28259</td>\\n\",\n       \"      <td>620936</td>\\n\",\n       \"      <td>1319412</td>\\n\",\n       \"      <td>449617</td>\\n\",\n       \"      <td>1234979</td>\\n\",\n       \"      <td>1028142</td>\\n\",\n       \"      <td>1241334</td>\\n\",\n       \"      <td>666910</td>\\n\",\n       \"      <td>147714</td>\\n\",\n       \"      <td>1383070</td>\\n\",\n       \"      <td>1238365</td>\\n\",\n       \"      <td>883326</td>\\n\",\n       \"      <td>1475460</td>\\n\",\n       \"      <td>19943</td>\\n\",\n       \"      <td>24</td>\\n\",\n       \"      <td>13477</td>\\n\",\n       \"      <td>40</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>3</th>\\n\",\n       \"      <td>1079327</td>\\n\",\n       \"      <td>866864</td>\\n\",\n       \"      <td>1319412</td>\\n\",\n       \"      <td>449617</td>\\n\",\n       \"      <td>1234979</td>\\n\",\n       \"      <td>183044</td>\\n\",\n       \"      <td>1241334</td>\\n\",\n       \"      <td>431560</td>\\n\",\n       \"      <td>329890</td>\\n\",\n       \"      <td>1383070</td>\\n\",\n       \"      <td>1238365</td>\\n\",\n       \"      <td>36798</td>\\n\",\n       \"      <td>1475460</td>\\n\",\n       \"      <td>212989</td>\\n\",\n       \"      <td>2784</td>\\n\",\n       \"      <td>222</td>\\n\",\n       \"      <td>1</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>4</th>\\n\",\n       \"      <td>309747</td>\\n\",\n       \"      <td>620936</td>\\n\",\n       \"      <td>1319412</td>\\n\",\n       \"      <td>518725</td>\\n\",\n       \"      <td>1234979</td>\\n\",\n       \"      <td>1028142</td>\\n\",\n       \"      <td>1241334</td>\\n\",\n       \"      <td>666910</td>\\n\",\n       \"      <td>147714</td>\\n\",\n       \"      <td>1383070</td>\\n\",\n       \"      <td>1238365</td>\\n\",\n       \"      <td>883326</td>\\n\",\n       \"      <td>1475460</td>\\n\",\n       \"      <td>26161</td>\\n\",\n       \"      <td>258</td>\\n\",\n       \"      <td>34</td>\\n\",\n       \"      <td>13</td>\\n\",\n       \"    </tr>\\n\",\n       \"  </tbody>\\n\",\n       \"</table>\\n\",\n       \"</div>\"\n      ],\n      \"text/plain\": [\n       \"   ps_ind_02_cat_count  ps_ind_04_cat_count  ps_ind_05_cat_count  \\\\\\n\",\n       \"0               309747               620936              1319412   \\n\",\n       \"1              1079327               866864              1319412   \\n\",\n       \"2                28259               620936              1319412   \\n\",\n       \"3              1079327               866864              1319412   \\n\",\n       \"4               309747               620936              1319412   \\n\",\n       \"\\n\",\n       \"   ps_car_01_cat_count  ps_car_02_cat_count  ps_car_03_cat_count  \\\\\\n\",\n       \"0               124587              1234979              1028142   \\n\",\n       \"1               518725              1234979              1028142   \\n\",\n       \"2               449617              1234979              1028142   \\n\",\n       \"3               449617              1234979               183044   \\n\",\n       \"4               518725              1234979              1028142   \\n\",\n       \"\\n\",\n       \"   ps_car_04_cat_count  ps_car_05_cat_count  ps_car_06_cat_count  \\\\\\n\",\n       \"0              1241334               431560                77845   \\n\",\n       \"1              1241334               666910               329890   \\n\",\n       \"2              1241334               666910               147714   \\n\",\n       \"3              1241334               431560               329890   \\n\",\n       \"4              1241334               666910               147714   \\n\",\n       \"\\n\",\n       \"   ps_car_07_cat_count  ps_car_08_cat_count  ps_car_09_cat_count  \\\\\\n\",\n       \"0              1383070               249663               486510   \\n\",\n       \"1              1383070              1238365               883326   \\n\",\n       \"2              1383070              1238365               883326   \\n\",\n       \"3              1383070              1238365                36798   \\n\",\n       \"4              1383070              1238365               883326   \\n\",\n       \"\\n\",\n       \"   ps_car_10_cat_count  ps_car_11_cat_count  new_ind_count  new_reg_count  \\\\\\n\",\n       \"0              1475460                18326              6             24   \\n\",\n       \"1              1475460                12535             36             38   \\n\",\n       \"2              1475460                19943             24          13477   \\n\",\n       \"3              1475460               212989           2784            222   \\n\",\n       \"4              1475460                26161            258             34   \\n\",\n       \"\\n\",\n       \"   new_car_count  \\n\",\n       \"0              1  \\n\",\n       \"1             11  \\n\",\n       \"2             40  \\n\",\n       \"3              1  \\n\",\n       \"4             13  \"\n      ]\n     },\n     \"metadata\": {},\n     \"output_type\": \"display_data\"\n    }\n   ],\n   \"source\": [\n    \"# display(train[['new_ind','new_reg','new_car']].head(5))\\n\",\n    \"display(train[cat_count_features].head(5))\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## 2.3 Get the feature from feature training ## \"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 6,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"train_fea0, test_fea0 = pickle.load(open(\\\"../input/fea0.pk\\\",'rb'), encoding='iso-8859-1')\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## 2.4 Statistic features ##\\n\",\n    \"\\n\",\n    \"- find the feature of median, mean and standard deviation\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 10,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"#feature aggregation\\n\",\n    \"target_features = ['ps_car_13', 'ps_ind_03', 'ps_reg_03', 'ps_ind_15', 'ps_reg_01', 'ps_ind_01']\\n\",\n    \"group_features = ['ps_car_13', 'ps_ind_03', 'ps_reg_03', 'ps_ind_15', 'ps_reg_01', 'ps_ind_01', 'ps_ind_05_cat']\\n\",\n    \"\\n\",\n    \"#return numpy because we need to do np.hstack to merge all statistic feature together, so that it would return np array\\n\",\n    \"train_statis, test_statis =  Statistic_features(train, test, target_features, group_features)\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 11,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/plain\": [\n       \"array([[  1.57043000e+05,   8.32786207e-01,   2.41530046e-01, ...,\\n\",\n       \"          1.00000000e+00,   7.00000000e+00,   0.00000000e+00],\\n\",\n       \"       [  1.30452000e+05,   8.26528390e-01,   2.35133348e-01, ...,\\n\",\n       \"          1.00000000e+00,   7.00000000e+00,   0.00000000e+00],\\n\",\n       \"       [  6.35510000e+04,   8.13168936e-01,   2.35946815e-01, ...,\\n\",\n       \"          1.00000000e+00,   7.00000000e+00,   0.00000000e+00],\\n\",\n       \"       ..., \\n\",\n       \"       [  3.58630000e+04,   8.00633360e-01,   2.34463222e-01, ...,\\n\",\n       \"          1.00000000e+00,   7.00000000e+00,   0.00000000e+00],\\n\",\n       \"       [  2.04836000e+05,   8.24270444e-01,   2.28649975e-01, ...,\\n\",\n       \"          1.00000000e+00,   7.00000000e+00,   0.00000000e+00],\\n\",\n       \"       [  9.85280000e+04,   8.24453229e-01,   2.37806003e-01, ...,\\n\",\n       \"          1.00000000e+00,   7.00000000e+00,   0.00000000e+00]])\"\n      ]\n     },\n     \"metadata\": {},\n     \"output_type\": \"display_data\"\n    }\n   ],\n   \"source\": [\n    \"display(train_statis)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## 2.5 Combine all feature together, and get ready for training ##\\n\",\n    \"\\n\",\n    \"1. merge features into train_list & test_list that would like to dump into NN model\\n\",\n    \"2. training a scaler by sparse that generated by train_list & test_list\\n\",\n    \"4. convert train_list & test_list into X , X_test, which has been scaled\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 12,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"'''\\n\",\n    \"Building a train list including train_num, cat_count_features, statistic feature and infered feature training by XGboost.\\n\",\n    \"\\n\",\n    \"train_num: training data without set of cat_calc\\n\",\n    \"cat_count_features: cat_fea + ['new_ind','new_reg','new_car']\\n\",\n    \"train_fea0: feature extraction \\n\",\n    \"'''\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"#training data without set of cat_calc\\n\",\n    \"train_num = train[[x for x in list(train) if x in num_features]]\\n\",\n    \"test_num = test[[x for x in list(train) if x in num_features]]\\n\",\n    \"\\n\",\n    \"train_list = [train_num.replace([np.inf, -np.inf, np.nan], 0), train[cat_count_features], train_statis, train_fea0 ]\\n\",\n    \"test_list = [test_num.replace([np.inf, -np.inf, np.nan], 0), test[cat_count_features], test_statis,test_fea0] \"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 13,\n   \"metadata\": {\n    \"scrolled\": true\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"'''\\n\",\n    \"X are stacked from 5 features\\n\",\n    \"1. train_num(595212,54): training data without set of cat_calc\\n\",\n    \"2. cat_count_features(595212,17): cat_fea + ['new_ind','new_reg','new_car']\\n\",\n    \"3. feature statis(595212,6) * 36\\n\",\n    \"4. train_fea0(595212, 38): feature extraction\\n\",\n    \"\\n\",\n    \"all_data (595212, 235)\\n\",\n    \"'''\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"X = sparse.hstack(train_list).tocsr()\\n\",\n    \"X_test = sparse.hstack(test_list).tocsr()\\n\",\n    \"\\n\",\n    \"all_data = np.vstack([X.toarray(), X_test.toarray()])\\n\",\n    \"scaler = StandardScaler()\\n\",\n    \"scaler.fit(all_data)\\n\",\n    \"X = scaler.transform(X.toarray())\\n\",\n    \"X_test = scaler.transform(X_test.toarray())\\n\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## 2.6 Create Cat_feature for NN embeding training ##\\n\",\n    \" Don't ask why they doing this! they only tell you what is this\\n\",\n    \" \\n\",\n    \" - in the feature NN model of the finial testing data, you would need the list, which likes **[[cat_featrue], X]**\\n\",\n    \" or **[['ps_ind_02_cat', 'ps_ind_04_cat', 'ps_ind_05_cat',.....,'ps_car_11_cat'], X]**\\n\",\n    \" \\n\",\n    \" **_This is to process the above testing data. If you could not understand, that is fine, and just look the next steps_**\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 24,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"/Users/stevenhu/anaconda3/envs/run/lib/python3.6/site-packages/pandas/core/indexing.py:630: SettingWithCopyWarning: \\n\",\n      \"A value is trying to be set on a copy of a slice from a DataFrame.\\n\",\n      \"Try using .loc[row_indexer,col_indexer] = value instead\\n\",\n      \"\\n\",\n      \"See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy\\n\",\n      \"  self.obj[item_labels[indexer[info_axis]]] = value\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"#preparing for training cat \\n\",\n    \"train_cat = train[cat_fea]\\n\",\n    \"test_cat = test[cat_fea]\\n\",\n    \"\\n\",\n    \"# convert pd to np.array\\n\",\n    \"X_cat = train_cat.values\\n\",\n    \"tem = test_cat.values\\n\",\n    \"\\n\",\n    \"# storing the dimension for embedding layer as an input value\\n\",\n    \"max_cat_values = []\\n\",\n    \"\\n\",\n    \"for c in cat_fea:\\n\",\n    \"    \\n\",\n    \"    #nomalize the label\\n\",\n    \"    #LabelEncoder: http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html\\n\",\n    \"    \\n\",\n    \"    le = LabelEncoder()\\n\",\n    \"    x = le.fit_transform(pd.concat([train_cat, test_cat])[c])\\n\",\n    \"    train_cat.loc[:,c] = le.transform(train_cat[c])\\n\",\n    \"    test_cat.loc[:,c] = le.transform(test_cat[c])\\n\",\n    \"    max_cat_values.append(np.max(x))\\n\",\n    \"\\n\",\n    \"# Build the final testing data\\n\",\n    \"X_TEST_CAT = []\\n\",\n    \"for i in range(tem.shape[1]):\\n\",\n    \"    X_TEST_CAT.append(tem[:, i].reshape(-1, 1))\\n\",\n    \"X_TEST_CAT.append(X_test)\\n\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 57,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"cat_fea: ['ps_ind_02_cat', 'ps_ind_04_cat', 'ps_ind_05_cat', 'ps_car_01_cat', 'ps_car_02_cat', 'ps_car_03_cat', 'ps_car_04_cat', 'ps_car_05_cat', 'ps_car_06_cat', 'ps_car_07_cat', 'ps_car_08_cat', 'ps_car_09_cat', 'ps_car_10_cat', 'ps_car_11_cat']\\n\",\n      \"\\n\",\n      \"max_cat_values:  [4, 2, 7, 12, 2, 2, 9, 2, 17, 2, 1, 5, 2, 103]\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"print('cat_fea:', cat_fea)\\n\",\n    \"print('\\\\nmax_cat_values: ',max_cat_values)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# 3. Training NN Model with Keras # \"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"1. Build the model\\n\",\n    \"2. training\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## 3.1 Build the model ##\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"### model structure:  ###\\n\",\n    \"<img src=\\\"Jupyter_image/NN_layer.png\\\">\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 47,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"def nn_model():\\n\",\n    \"    inputs = []\\n\",\n    \"    flatten_layers = []\\n\",\n    \"    for e, c in enumerate(cat_fea):\\n\",\n    \"        input_c = Input(shape=(1, ), dtype='int32')\\n\",\n    \"        num_c = max_cat_values[e]\\n\",\n    \"        \\n\",\n    \"        # need to add 1, https://keras.io/layers/embeddings/\\n\",\n    \"        # **input_dim: int > 0. Size of the vocabulary, i.e. maximum integer index + 1.**\\n\",\n    \"        embed_c = Embedding(num_c+1,6,input_length=1)(input_c)\\n\",\n    \"        embed_c = Dropout(0.25)(embed_c)\\n\",\n    \"        flatten_c = Flatten()(embed_c)\\n\",\n    \"        inputs.append(input_c)\\n\",\n    \"        flatten_layers.append(flatten_c)\\n\",\n    \"        \\n\",\n    \"    \\n\",\n    \"    input_num = Input(shape=(X.shape[1],), dtype='float32')\\n\",\n    \"    inputs.append(input_num)\\n\",\n    \"    \\n\",\n    \"    #merge X and embedding layer\\n\",\n    \"    flatten_layers.append(input_num)\\n\",\n    \"    flatten = merge(flatten_layers, mode='concat')\\n\",\n    \"\\n\",\n    \"    fc1 = Dense(512, kernel_initializer='he_normal')(flatten)\\n\",\n    \"    fc1 = PReLU()(fc1)\\n\",\n    \"    fc1 = BatchNormalization()(fc1)\\n\",\n    \"    fc1 = Dropout(0.75)(fc1)\\n\",\n    \"\\n\",\n    \"    fc1 = Dense(64, kernel_initializer='he_normal')(fc1)\\n\",\n    \"    fc1 = PReLU()(fc1)\\n\",\n    \"    fc1 = BatchNormalization()(fc1)\\n\",\n    \"    fc1 = Dropout(0.5)(fc1)\\n\",\n    \"\\n\",\n    \"    outputs = Dense(1, kernel_initializer='he_normal', activation='sigmoid')(fc1)\\n\",\n    \"\\n\",\n    \"    model = Model(inputs = inputs, outputs = outputs)\\n\",\n    \"    model.compile(loss='binary_crossentropy', optimizer='adam')\\n\",\n    \"    return (model)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## 3.2 Start to Train ##\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 51,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"/Users/stevenhu/anaconda3/envs/run/lib/python3.6/site-packages/ipykernel_launcher.py:22: UserWarning: The `merge` function is deprecated and will be removed after 08/2017. Use instead layers from `keras.layers.merge`, e.g. `add`, `concatenate`, etc.\\n\",\n      \"/Users/stevenhu/anaconda3/envs/run/lib/python3.6/site-packages/keras/legacy/layers.py:465: UserWarning: The `Merge` layer is deprecated and will be removed after 08/2017. Use instead layers from `keras.layers.merge`, e.g. `add`, `concatenate`, etc.\\n\",\n      \"  name=name)\\n\"\n     ]\n    },\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"Train on 297606 samples, validate on 297606 samples\\n\",\n      \"Epoch 1/1\\n\",\n      \" - 27s - loss: 0.3061 - val_loss: 0.1639\\n\",\n      \"local fold Gini:  0.209322322663\\n\",\n      \"Train on 297606 samples, validate on 297606 samples\\n\",\n      \"Epoch 1/1\\n\",\n      \" - 28s - loss: 0.3087 - val_loss: 0.1645\\n\",\n      \"local fold Gini:  0.201585256464\\n\",\n      \"seed 0: Gini 0.20545379936910999\\n\",\n      \"Total training time:  0:01:55.265348\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"\\\"\\\"\\\"\\n\",\n    \"#validation fold\\n\",\n    \"NFOLDS = 5\\n\",\n    \"kfold = StratifiedKFold(n_splits=NFOLDS, shuffle=True, random_state=218)\\n\",\n    \"\\n\",\n    \"I change \\\"test\\\" to \\\"vaild\\\" because I feel it is clear to understand\\n\",\n    \"\\\"\\\"\\\"\\n\",\n    \"\\n\",\n    \"cv_train = np.zeros(len(train_label))\\n\",\n    \"cv_pred = np.zeros(len(test_id))\\n\",\n    \"\\n\",\n    \"#validation fold\\n\",\n    \"NFOLDS = 5\\n\",\n    \"kfold = StratifiedKFold(n_splits=NFOLDS, shuffle=True, random_state=218)\\n\",\n    \"\\n\",\n    \"#with different random see make result stable.\\n\",\n    \"num_seeds = 5\\n\",\n    \"begintime = time()\\n\",\n    \"if cv_only:\\n\",\n    \"    for s in range(num_seeds):\\n\",\n    \"        np.random.seed(s)\\n\",\n    \"        for (train_index, valid_index) in kfold.split(X, train_label):\\n\",\n    \"            \\n\",\n    \"            #assign data from training data and labels to validation data; \\n\",\n    \"            x_train = X[train_index]\\n\",\n    \"            y_train = train_label[train_index]\\n\",\n    \"            x_valid= X[valid_index]\\n\",\n    \"            y_valid = train_label[valid_index]\\n\",\n    \"            \\n\",\n    \"            # assign X_cat to validation data; \\n\",\n    \"            x_train_cat = X_cat[train_index]\\n\",\n    \"            x_valid_cat = X_cat[valid_index]\\n\",\n    \"\\n\",\n    \"            #Package data for training, the package(list) is  [[cat_featrues], x_train] \\n\",\n    \"            # or [ ['ps_ind_02_cat', 'ps_ind_04_cat', 'ps_ind_05_cat',.....,'ps_car_11_cat'] ,x_train]\\n\",\n    \"            \\n\",\n    \"            x_train_cat_list, x_valid_cat_list = [], []\\n\",\n    \"            for i in range(x_train_cat.shape[1]):\\n\",\n    \"                x_train_cat_list.append(x_train_cat[:, i].reshape(-1, 1))\\n\",\n    \"                x_valid_cat_list.append(x_valid_cat[:, i].reshape(-1, 1))\\n\",\n    \"\\n\",\n    \"            x_train_cat_list.append(x_train)\\n\",\n    \"            x_valid_cat_list.append(x_valid)\\n\",\n    \"            \\n\",\n    \"            #load model\\n\",\n    \"            model = nn_model()\\n\",\n    \"            \\n\",\n    \"            def get_rank(x):\\n\",\n    \"                return pd.Series(x).rank(pct=True).values\\n\",\n    \"            #fit model. Note: Change epochs to make prediction accuracy\\n\",\n    \"            model.fit(x_train_cat_list, y_train, epochs=10, batch_size=512, verbose=2, validation_data=[x_valid_cat_list, y_valid])\\n\",\n    \"            \\n\",\n    \"            #record prediction with validation data\\n\",\n    \"            cv_train[valid_index] += get_rank(model.predict(x=x_valid_cat_list, batch_size=512, verbose=0)[:, 0])\\n\",\n    \"            print('local fold Gini: ',Gini(train_label[valid_index], cv_train[valid_index]))\\n\",\n    \"            \\n\",\n    \"            #recode prediction with testing data\\n\",\n    \"            cv_pred += get_rank(model.predict(x=X_TEST_CAT, batch_size=512, verbose=0)[:, 0])\\n\",\n    \"             \\n\",\n    \"            \\n\",\n    \"        \\n\",\n    \"        print(\\\"seed {0}: Gini {1}\\\".format(s,Gini(train_label, cv_train / (1. * (s + 1)))))\\n\",\n    \"        print(\\\"Total training time: \\\",str(datetime.timedelta(seconds=time() - begintime)))\\n\",\n    \"    if save_cv:\\n\",\n    \"        \\n\",\n    \"        #divid (NFOLDS * num_seeds) to get average of probablity \\n\",\n    \"        pd.DataFrame({'id': test_id, 'target': get_rank(cv_pred * 1./ (NFOLDS * num_seeds))}).to_csv('../model/keras5_pred.csv', index=False)\\n\",\n    \"        pd.DataFrame({'id': train_id, 'target': get_rank(cv_train * 1. / num_seeds)}).to_csv('../model/keras5_cv.csv', index=False)\\n\",\n    \"\\n\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"kernelspec\": {\n   \"display_name\": \"Python [default]\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.6.4\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}\n"
  },
  {
    "path": "Jupyter_nnmodel/util.py",
    "content": "import numpy as np\nimport pandas as pd\n\ndef Gini(y_true, y_pred):\n    # check and get number of samples\n    assert y_true.shape == y_pred.shape\n    n_samples = y_true.shape[0]\n\n    # sort rows on prediction column\n    # (from largest to smallest)\n    arr = np.array([y_true, y_pred]).transpose()\n    true_order = arr[arr[:, 0].argsort()][::-1, 0]\n    pred_order = arr[arr[:, 1].argsort()][::-1, 0]\n\n    # get Lorenz curves\n    L_true = np.cumsum(true_order) * 1. / np.sum(true_order)\n    L_pred = np.cumsum(pred_order) * 1. / np.sum(pred_order)\n    L_ones = np.linspace(1 / n_samples, 1, n_samples)\n\n    # get Gini coefficients (area between curves)\n    G_true = np.sum(L_ones - L_true)\n    G_pred = np.sum(L_ones - L_pred)\n\n    # normalize to true Gini coefficient\n    return G_pred * 1. / G_true\n\n\ndef cat_count(train_df, test_df, cat_list):\n    train_df['row_id'] = range(train_df.shape[0])\n    test_df['row_id'] = range(test_df.shape[0])\n    train_df['train'] = 1\n    test_df['train'] = 0\n    all_df = train_df[['row_id', 'train'] + cat_list].append(test_df[['row_id','train'] + cat_list])\n    for e, cat in enumerate(cat_list):\n        grouped = all_df[[cat]].groupby(cat)\n        the_size = pd.DataFrame(grouped.size()).reset_index()\n        the_size.columns = [cat, '{}_size'.format(cat)]\n        all_df = pd.merge(all_df, the_size, how='left')\n\n        selected_train = all_df[all_df['train'] == 1]\n        selected_test = all_df[all_df['train'] == 0]\n        selected_train.sort_values('row_id', inplace=True)\n        selected_test.sort_values('row_id', inplace=True)\n        selected_train.drop(['row_id', 'train'] + cat_list, axis=1, inplace=True)\n        selected_test.drop(['row_id', 'train'] + cat_list, axis=1, inplace=True)\n\n        selected_train, selected_test = np.array(selected_train), np.array(selected_test)\n    print(selected_train.shape, selected_test.shape)\n    return selected_train, selected_test\n\n\ndef proj_num_on_cat(train_df, test_df, target_column, group_column):\n    \"\"\"\n    :param train_df: train data frame\n    :param test_df:  test data frame\n    :param target_column: name of numerical feature\n    :param group_column: name of categorical feature\n    \"\"\"\n    train_df['row_id'] = range(train_df.shape[0])  # 595211 create index for each row\n    \n    test_df['row_id'] = range(test_df.shape[0])\n    train_df['train'] = 1\n    test_df['train'] = 0\n    \n    all_df = train_df[['row_id', 'train', target_column, group_column]].append(test_df[['row_id','train',\n                                                                                        target_column, group_column]]).copy()\n    \n    \n    #count the number\n    grouped = all_df[[target_column, group_column]].groupby(group_column)\n    \n    \n    #count the number of distint value  from the list [1,1, 2,3]   \n    #[1,2,3]  so answer is  3 \n    \n    #count the number of each distint value  [1,1,2,3]  \n    #1:2 \n    #2:1 \n    #3:1\n    \n    \n    #count the number of each distint value\n    the_size = pd.DataFrame(grouped.size()).reset_index()\n    the_size.columns = [group_column, '%s_size' % target_column]  #rename columns name\n\n    #find the mean, std, median, max, min of each distint value\n    the_mean = pd.DataFrame(grouped.mean()).reset_index()\n    the_mean.columns = [group_column, '%s_mean' % target_column] #rename columns name\n    the_std = pd.DataFrame(grouped.std()).reset_index().fillna(0)\n    the_std.columns = [group_column, '%s_std' % target_column]\n    the_median = pd.DataFrame(grouped.median()).reset_index()\n    the_median.columns = [group_column, '%s_median' % target_column]\n    the_max = pd.DataFrame(grouped.max()).reset_index()\n    the_max.columns = [group_column, '%s_max' % target_column]\n    the_min = pd.DataFrame(grouped.min()).reset_index()\n    the_min.columns = [group_column, '%s_min' % target_column]\n    \n    #merge them \n    the_stats=pd.concat([the_size,the_mean.iloc[:,1],the_std.iloc[:,1]\n                         ,the_median.iloc[:,1] ,the_max.iloc[:,1],the_min.iloc[:,1]]\n                        ,axis=1, join_axes=[the_size.index])\n\n    #insert value to the original data\n    all_df = pd.merge(all_df, the_stats, how='left')\n\n    #splite to train and test\n    selected_train = all_df[all_df['train'] == 1].copy()\n    selected_test = all_df[all_df['train'] == 0].copy()\n    selected_train.sort_values('row_id', inplace=True)\n    selected_test.sort_values('row_id', inplace=True)\n    \n    #remove target_column, group_column, 'row_id', 'train' columns\n    selected_train.drop([target_column, group_column, 'row_id', 'train'], axis=1, inplace=True)\n    selected_test.drop([target_column, group_column, 'row_id', 'train'], axis=1, inplace=True)\n\n    selected_train, selected_test = np.array(selected_train), np.array(selected_test)\n    return selected_train, selected_test\n\n\ndef interaction_features(train, test, fea1, fea2, prefix):\n    train['inter_{}*'.format(prefix)] = train[fea1] * train[fea2]\n    train['inter_{}/'.format(prefix)] = train[fea1] / train[fea2]\n\n    test['inter_{}*'.format(prefix)] = test[fea1] * test[fea2]\n    test['inter_{}/'.format(prefix)] = test[fea1] / test[fea2]\n    feature_name = ['inter_{}*'.format(prefix), 'inter_{}/'.format(prefix), ]\n\n    return train, test, feature_name\n"
  },
  {
    "path": "code/fea_eng0.py",
    "content": "\"\"\"xgb prediction as features\"\"\"\nimport xgboost as xgb\nfrom sklearn.model_selection import KFold\nimport numpy as np\nimport pandas as pd\n\neta = 0.1\nmax_depth = 6\nsubsample = 0.9\ncolsample_bytree = 0.85\nmin_child_weight = 55\nnum_boost_round = 500\n\ntrain = pd.read_csv(\"../input/train.csv\")\ntrain_label = train['target']\ntrain_id = train['id']\ndel train['target'], train['id']\n\ntest = pd.read_csv(\"../input/test.csv\")\ntest_id = test['id']\ndel test['id']\n\n\nparams = {\"objective\": \"reg:linear\",\n          \"booster\": \"gbtree\",\n          \"eta\": eta,\n          \"max_depth\": int(max_depth),\n          \"subsample\": subsample,\n          \"colsample_bytree\": colsample_bytree,\n          \"min_child_weight\": min_child_weight,\n          \"silent\": 1\n          }\n\ndata = train.append(test)\ndata.reset_index(inplace=True)\ntrain_rows = train.shape[0]\n\nfeature_results = []\n\nfor target_g in ['car', 'ind', 'reg']:\n    features = [x for x in list(data) if target_g not in x]\n    target_list = [x for x in list(data) if target_g in x]\n    train_fea = np.array(data[features])\n    for target in target_list:\n        print(target)\n        train_label = data[target]\n        kfold = KFold(n_splits=5, random_state=218, shuffle=True)\n        kf = kfold.split(data)\n        cv_train = np.zeros(shape=(data.shape[0], 1))\n        for i, (train_fold, validate) in enumerate(kf):\n            X_train, X_validate, label_train, label_validate = \\\n                train_fea[train_fold, :], train_fea[validate, :], train_label[train_fold], train_label[validate]\n            dtrain = xgb.DMatrix(X_train, label_train)\n            dvalid = xgb.DMatrix(X_validate, label_validate)\n            watchlist = [(dtrain, 'train'), (dvalid, 'valid')]\n            bst = xgb.train(params, dtrain, num_boost_round, evals=watchlist, verbose_eval=50,\n                            early_stopping_rounds=10)\n            cv_train[validate, 0] += bst.predict(xgb.DMatrix(X_validate), ntree_limit=bst.best_ntree_limit)\n        feature_results.append(cv_train)\n\nfeature_results = np.hstack(feature_results)\ntrain_features = feature_results[:train_rows, :]\ntest_features = feature_results[train_rows:, :]\n\nimport pickle\npickle.dump([train_features, test_features], open(\"../input/fea0.pk\", 'wb'))\n"
  },
  {
    "path": "code/gbm_model291.py",
    "content": "import lightgbm as lgbm\nfrom scipy import sparse as ssp\nfrom sklearn.model_selection import StratifiedKFold\nimport numpy as np\nimport pandas as pd\nfrom sklearn.preprocessing import LabelEncoder\nfrom sklearn.preprocessing import OneHotEncoder\n\ndef Gini(y_true, y_pred):\n    # check and get number of samples\n    assert y_true.shape == y_pred.shape\n    n_samples = y_true.shape[0]\n\n    # sort rows on prediction column\n    # (from largest to smallest)\n    arr = np.array([y_true, y_pred]).transpose()\n    true_order = arr[arr[:, 0].argsort()][::-1, 0]\n    pred_order = arr[arr[:, 1].argsort()][::-1, 0]\n\n    # get Lorenz curves\n    L_true = np.cumsum(true_order) * 1. / np.sum(true_order)\n    L_pred = np.cumsum(pred_order) * 1. / np.sum(pred_order)\n    L_ones = np.linspace(1 / n_samples, 1, n_samples)\n\n    # get Gini coefficients (area between curves)\n    G_true = np.sum(L_ones - L_true)\n    G_pred = np.sum(L_ones - L_pred)\n\n    # normalize to true Gini coefficient\n    return G_pred * 1. / G_true\n\ncv_only = True\nsave_cv = True\nfull_train = False\n\ndef evalerror(preds, dtrain):\n    labels = dtrain.get_label()\n    return 'gini', Gini(labels, preds), True\n\npath = \"../input/\"\n\ntrain = pd.read_csv(path+'train.csv')\ntrain_label = train['target']\ntrain_id = train['id']\ntest = pd.read_csv(path+'test.csv')\ntest_id = test['id']\n\nNFOLDS = 5\nkfold = StratifiedKFold(n_splits=NFOLDS, shuffle=True, random_state=218)\n\ny = train['target'].values\ndrop_feature = [\n    'id',\n    'target'\n]\n\nX = train.drop(drop_feature,axis=1)\nfeature_names = X.columns.tolist()\ncat_features = [c for c in feature_names if ('cat' in c and 'count' not in c)]\nnum_features = [c for c in feature_names if ('cat' not in c and 'calc' not in c)]\n\ntrain['missing'] = (train==-1).sum(axis=1).astype(float)\ntest['missing'] = (test==-1).sum(axis=1).astype(float)\nnum_features.append('missing')\n\nfor c in cat_features:\n    le = LabelEncoder()\n    le.fit(train[c])\n    train[c] = le.transform(train[c])\n    test[c] = le.transform(test[c])\n\nenc = OneHotEncoder()\nenc.fit(train[cat_features])\nX_cat = enc.transform(train[cat_features])\nX_t_cat = enc.transform(test[cat_features])\n\nind_features = [c for c in feature_names if 'ind' in c]\ncount=0\nfor c in ind_features:\n    if count==0:\n        train['new_ind'] = train[c].astype(str)+'_'\n        test['new_ind'] = test[c].astype(str)+'_'\n        count+=1\n    else:\n        train['new_ind'] += train[c].astype(str)+'_'\n        test['new_ind'] += test[c].astype(str)+'_'\n\ncat_count_features = []\nfor c in cat_features+['new_ind']:\n    d = pd.concat([train[c],test[c]]).value_counts().to_dict()\n    train['%s_count'%c] = train[c].apply(lambda x:d.get(x,0))\n    test['%s_count'%c] = test[c].apply(lambda x:d.get(x,0))\n    cat_count_features.append('%s_count'%c)\n\ntrain_list = [train[num_features+cat_count_features].values,X_cat,]\ntest_list = [test[num_features+cat_count_features].values,X_t_cat,]\n\nX = ssp.hstack(train_list).tocsr()\nX_test = ssp.hstack(test_list).tocsr()\n\nlearning_rate = 0.1\nnum_leaves = 15\nmin_data_in_leaf = 2000\nfeature_fraction = 0.6\nnum_boost_round = 10000\nparams = {\"objective\": \"binary\",\n          \"boosting_type\": \"gbdt\",\n          \"learning_rate\": learning_rate,\n          \"num_leaves\": num_leaves,\n           \"max_bin\": 256,\n          \"feature_fraction\": feature_fraction,\n          \"verbosity\": 0,\n          \"drop_rate\": 0.1,\n          \"is_unbalance\": False,\n          \"max_drop\": 50,\n          \"min_child_samples\": 10,\n          \"min_child_weight\": 150,\n          \"min_split_gain\": 0,\n          \"subsample\": 0.9\n          }\n\nx_score = []\nfinal_cv_train = np.zeros(len(train_label))\nfinal_cv_pred = np.zeros(len(test_id))\nfor s in xrange(16):\n    cv_train = np.zeros(len(train_label))\n    cv_pred = np.zeros(len(test_id))\n\n    params['seed'] = s\n\n    if cv_only:\n        kf = kfold.split(X, train_label)\n\n        best_trees = []\n        fold_scores = []\n\n        for i, (train_fold, validate) in enumerate(kf):\n            X_train, X_validate, label_train, label_validate = \\\n                X[train_fold, :], X[validate, :], train_label[train_fold], train_label[validate]\n            dtrain = lgbm.Dataset(X_train, label_train)\n            dvalid = lgbm.Dataset(X_validate, label_validate, reference=dtrain)\n            bst = lgbm.train(params, dtrain, num_boost_round, valid_sets=dvalid, feval=evalerror, verbose_eval=100,\n                            early_stopping_rounds=100)\n            best_trees.append(bst.best_iteration)\n            cv_pred += bst.predict(X_test, num_iteration=bst.best_iteration)\n            cv_train[validate] += bst.predict(X_validate)\n\n            score = Gini(label_validate, cv_train[validate])\n            print score\n            fold_scores.append(score)\n\n        cv_pred /= NFOLDS\n        final_cv_train += cv_train\n        final_cv_pred += cv_pred\n\n        print(\"cv score:\")\n        print Gini(train_label, cv_train)\n        print \"current score:\", Gini(train_label, final_cv_train / (s + 1.)), s+1\n        print(fold_scores)\n        print(best_trees, np.mean(best_trees))\n\n        x_score.append(Gini(train_label, cv_train))\n\nprint(x_score)\npd.DataFrame({'id': test_id, 'target': final_cv_pred / 16.}).to_csv('../model/lgbm3_pred_avg.csv', index=False)\npd.DataFrame({'id': train_id, 'target': final_cv_train / 16.}).to_csv('../model/lgbm3_cv_avg.csv', index=False)\n\n"
  },
  {
    "path": "code/nn_model290.py",
    "content": "from keras.layers import Dense, Dropout, Embedding, Flatten, Input, merge\nfrom keras.layers.normalization import BatchNormalization\nfrom keras.layers.advanced_activations import PReLU\nfrom time import time\nimport datetime\nfrom keras.models import Model\nfrom sklearn.model_selection import StratifiedKFold\nimport numpy as np\nimport pandas as pd\nfrom util import Gini, interaction_features\nfrom itertools import combinations\nfrom util import proj_num_on_cat\nfrom scipy import sparse\nfrom sklearn.preprocessing import StandardScaler\nimport pickle\nfrom sklearn.preprocessing import LabelEncoder\n\ncv_only = True\nsave_cv = True\n\nNFOLDS = 5\nkfold = StratifiedKFold(n_splits=NFOLDS, shuffle=True, random_state=218)\n\ntrain = pd.read_csv(\"../input/train.csv\")\ntrain_label = train['target']\ntrain_id = train['id']\ndel train['target'], train['id']\n\ntest = pd.read_csv(\"../input/test.csv\")\ntest_id = test['id']\ndel test['id']\n\ncat_fea = [x for x in list(train) if 'cat' in x]\nbin_fea = [x for x in list(train) if 'bin' in x]\n\ntrain['missing'] = (train==-1).sum(axis=1).astype(float)\ntest['missing'] = (test==-1).sum(axis=1).astype(float)\n\n# include interactions\nfor e, (x, y) in enumerate(combinations(['ps_car_13', 'ps_ind_03', 'ps_reg_03', 'ps_ind_15', 'ps_reg_01', 'ps_ind_01'], 2)):\n    train, test = interaction_features(train, test, x, y, e)\n\nnum_features = [c for c in list(train) if ('cat' not in c and 'calc' not in c)]\nnum_features.append('missing')\ninter_fea = [x for x in list(train) if 'inter' in x]\n\nfeature_names = list(train)\nind_features = [c for c in feature_names if 'ind' in c]\ncount = 0\nfor c in ind_features:\n    if count == 0:\n        train['new_ind'] = train[c].astype(str)\n        count += 1\n    else:\n        train['new_ind'] += '_' + train[c].astype(str)\n\nind_features = [c for c in feature_names if 'ind' in c]\ncount = 0\nfor c in ind_features:\n    if count == 0:\n        test['new_ind'] = test[c].astype(str)\n        count += 1\n    else:\n        test['new_ind'] += '_' + test[c].astype(str)\n\nreg_features = [c for c in feature_names if 'reg' in c]\ncount = 0\nfor c in reg_features:\n    if count == 0:\n        train['new_reg'] = train[c].astype(str)\n        count += 1\n    else:\n        train['new_reg'] += '_' + train[c].astype(str)\n\nreg_features = [c for c in feature_names if 'reg' in c]\ncount = 0\nfor c in reg_features:\n    if count == 0:\n        test['new_reg'] = test[c].astype(str)\n        count += 1\n    else:\n        test['new_reg'] += '_' + test[c].astype(str)\n\ncar_features = [c for c in feature_names if 'car' in c]\ncount = 0\nfor c in car_features:\n    if count == 0:\n        train['new_car'] = train[c].astype(str)\n        count += 1\n    else:\n        train['new_car'] += '_' + train[c].astype(str)\n\ncar_features = [c for c in feature_names if 'car' in c]\ncount = 0\nfor c in car_features:\n    if count == 0:\n        test['new_car'] = test[c].astype(str)\n        count += 1\n    else:\n        test['new_car'] += '_' + test[c].astype(str)\n\ntrain_cat = train[cat_fea]\ntrain_num = train[[x for x in list(train) if x in num_features]]\ntest_cat = test[cat_fea]\ntest_num = test[[x for x in list(train) if x in num_features]]\n\nmax_cat_values = []\nfor c in cat_fea:\n    le = LabelEncoder()\n    x = le.fit_transform(pd.concat([train_cat, test_cat])[c])\n    train_cat[c] = le.transform(train_cat[c])\n    test_cat[c] = le.transform(test_cat[c])\n    max_cat_values.append(np.max(x))\n\n# xgboost prediction\ntrain_fea0, test_fea0 = pickle.load(open(\"../input/fea0.pk\"))\n\ncat_count_features = []\nfor c in cat_fea + ['new_ind','new_reg','new_car']:\n    d = pd.concat([train[c],test[c]]).value_counts().to_dict()\n    train['%s_count'%c] = train[c].apply(lambda x:d.get(x,0))\n    test['%s_count'%c] = test[c].apply(lambda x:d.get(x,0))\n    cat_count_features.append('%s_count'%c)\n\n\nprint(train_num.dtypes)\ntrain_list = [train_num.replace([np.inf, -np.inf, np.nan], 0), train[cat_count_features], train_fea0]\ntest_list = [test_num.replace([np.inf, -np.inf, np.nan], 0), test[cat_count_features], test_fea0]\n\n#feature aggregation\nfor t in ['ps_car_13', 'ps_ind_03', 'ps_reg_03', 'ps_ind_15', 'ps_reg_01', 'ps_ind_01']:\n    for g in ['ps_car_13', 'ps_ind_03', 'ps_reg_03', 'ps_ind_15', 'ps_reg_01', 'ps_ind_01', 'ps_ind_05_cat']:\n        if t != g:\n            s_train, s_test = proj_num_on_cat(train, test, target_column=t, group_column=g)\n            train_list.append(s_train)\n            test_list.append(s_test)\nX = sparse.hstack(train_list).tocsr()\nX_test = sparse.hstack(test_list).tocsr()\n\nall_data = np.vstack([X.toarray(), X_test.toarray()])\nscaler = StandardScaler()\nscaler.fit(all_data)\nX = scaler.transform(X.toarray())\nX_test = scaler.transform(X_test.toarray())\nprint(X.shape, X_test.shape)\n\n\ncv_train = np.zeros(len(train_label))\ncv_pred = np.zeros(len(test_id))\n\nX_cat = train_cat.as_matrix()\nX_test_cat = test_cat.as_matrix()\n\nx_test_cat = []\nfor i in xrange(X_test_cat.shape[1]):\n    x_test_cat.append(X_test_cat[:, i].reshape(-1, 1))\nx_test_cat.append(X_test)\n\ndef nn_model():\n    inputs = []\n    flatten_layers = []\n    for e, c in enumerate(cat_fea):\n        input_c = Input(shape=(1, ), dtype='int32')\n        num_c = max_cat_values[e]\n        embed_c = Embedding(\n            num_c,\n            6,\n            input_length=1\n        )(input_c)\n        embed_c = Dropout(0.25)(embed_c)\n        flatten_c = Flatten()(embed_c)\n\n        inputs.append(input_c)\n        flatten_layers.append(flatten_c)\n\n    input_num = Input(shape=(X.shape[1],), dtype='float32')\n    flatten_layers.append(input_num)\n    inputs.append(input_num)\n\n    flatten = merge(flatten_layers, mode='concat')\n\n    fc1 = Dense(512, init='he_normal')(flatten)\n    fc1 = PReLU()(fc1)\n    fc1 = BatchNormalization()(fc1)\n    fc1 = Dropout(0.75)(fc1)\n\n    fc1 = Dense(64, init='he_normal')(fc1)\n    fc1 = PReLU()(fc1)\n    fc1 = BatchNormalization()(fc1)\n    fc1 = Dropout(0.5)(fc1)\n\n    outputs = Dense(1, init='he_normal', activation='sigmoid')(fc1)\n\n    model = Model(input = inputs, output = outputs)\n    model.compile(loss='binary_crossentropy', optimizer='adam')\n    return (model)\n\nnum_seeds = 5\nbegintime = time()\nif cv_only:\n    for s in xrange(num_seeds):\n        np.random.seed(s)\n        for (inTr, inTe) in kfold.split(X, train_label):\n            xtr = X[inTr]\n            ytr = train_label[inTr]\n            xte = X[inTe]\n            yte = train_label[inTe]\n\n            xtr_cat = X_cat[inTr]\n            xte_cat = X_cat[inTe]\n\n            # get xtr xte cat\n            xtr_cat_list, xte_cat_list = [], []\n            for i in xrange(xtr_cat.shape[1]):\n                xtr_cat_list.append(xtr_cat[:, i].reshape(-1, 1))\n                xte_cat_list.append(xte_cat[:, i].reshape(-1, 1))\n\n            xtr_cat_list.append(xtr)\n            xte_cat_list.append(xte)\n\n            model = nn_model()\n            def get_rank(x):\n                return pd.Series(x).rank(pct=True).values\n            model.fit(xtr_cat_list, ytr, epochs=20, batch_size=512, verbose=2, validation_data=[xte_cat_list, yte])\n            cv_train[inTe] += get_rank(model.predict(x=xte_cat_list, batch_size=512, verbose=0)[:, 0])\n            print(Gini(train_label[inTe], cv_train[inTe]))\n            cv_pred += get_rank(model.predict(x=x_test_cat, batch_size=512, verbose=0)[:, 0])\n        print(s)\n        print(Gini(train_label, cv_train / (1. * (s + 1))))\n        print(str(datetime.timedelta(seconds=time() - begintime)))\n    if save_cv:\n        pd.DataFrame({'id': test_id, 'target': get_rank(cv_pred * 1./ (NFOLDS * num_seeds))}).to_csv('../model/keras5_pred.csv', index=False)\n        pd.DataFrame({'id': train_id, 'target': get_rank(cv_train * 1. / num_seeds)}).to_csv('../model/keras5_cv.csv', index=False)\n\n"
  },
  {
    "path": "code/simple_average.py",
    "content": "'''\nsimple average of two models to get 2nd place\n'''\nimport pandas as pd\nkeras5_test = pd.read_csv(\"../model/keras5_pred.csv\")\nlgbm3_test = pd.read_csv(\"../model/lgbm3_pred_avg.csv\")\n\ndef get_rank(x):\n    return pd.Series(x).rank(pct=True).values\n\npd.DataFrame({'id': keras5_test['id'], 'target':\n    get_rank(keras5_test['target']) * 0.5 + get_rank(keras5_test['target']) * 0.5}).to_csv(\n    \"../model/simple_average.csv\", index = False)"
  },
  {
    "path": "code/util.py",
    "content": "import numpy as np\nimport pandas as pd\n\ndef Gini(y_true, y_pred):\n    # check and get number of samples\n    assert y_true.shape == y_pred.shape\n    n_samples = y_true.shape[0]\n\n    # sort rows on prediction column\n    # (from largest to smallest)\n    arr = np.array([y_true, y_pred]).transpose()\n    true_order = arr[arr[:, 0].argsort()][::-1, 0]\n    pred_order = arr[arr[:, 1].argsort()][::-1, 0]\n\n    # get Lorenz curves\n    L_true = np.cumsum(true_order) * 1. / np.sum(true_order)\n    L_pred = np.cumsum(pred_order) * 1. / np.sum(pred_order)\n    L_ones = np.linspace(1 / n_samples, 1, n_samples)\n\n    # get Gini coefficients (area between curves)\n    G_true = np.sum(L_ones - L_true)\n    G_pred = np.sum(L_ones - L_pred)\n\n    # normalize to true Gini coefficient\n    return G_pred * 1. / G_true\n\n\ndef cat_count(train_df, test_df, cat_list):\n    train_df['row_id'] = range(train_df.shape[0])\n    test_df['row_id'] = range(test_df.shape[0])\n    train_df['train'] = 1\n    test_df['train'] = 0\n    all_df = train_df[['row_id', 'train'] + cat_list].append(test_df[['row_id','train'] + cat_list])\n    for e, cat in enumerate(cat_list):\n        grouped = all_df[[cat]].groupby(cat)\n        the_size = pd.DataFrame(grouped.size()).reset_index()\n        the_size.columns = [cat, '{}_size'.format(cat)]\n        all_df = pd.merge(all_df, the_size, how='left')\n\n        selected_train = all_df[all_df['train'] == 1]\n        selected_test = all_df[all_df['train'] == 0]\n        selected_train.sort_values('row_id', inplace=True)\n        selected_test.sort_values('row_id', inplace=True)\n        selected_train.drop(['row_id', 'train'] + cat_list, axis=1, inplace=True)\n        selected_test.drop(['row_id', 'train'] + cat_list, axis=1, inplace=True)\n\n        selected_train, selected_test = np.array(selected_train), np.array(selected_test)\n    print(selected_train.shape, selected_test.shape)\n    return selected_train, selected_test\n\n\ndef proj_num_on_cat(train_df, test_df, target_column, group_column):\n    \"\"\"\n    :param train_df: train data frame\n    :param test_df:  test data frame\n    :param target_column: name of numerical feature\n    :param group_column: name of categorical feature\n    \"\"\"\n    train_df['row_id'] = range(train_df.shape[0])\n    test_df['row_id'] = range(test_df.shape[0])\n    train_df['train'] = 1\n    test_df['train'] = 0\n    all_df = train_df[['row_id', 'train', target_column, group_column]].append(test_df[['row_id','train',\n                                                                                        target_column, group_column]])\n    grouped = all_df[[target_column, group_column]].groupby(group_column)\n    the_size = pd.DataFrame(grouped.size()).reset_index()\n    the_size.columns = [group_column, '%s_size' % target_column]\n    the_mean = pd.DataFrame(grouped.mean()).reset_index()\n    the_mean.columns = [group_column, '%s_mean' % target_column]\n    the_std = pd.DataFrame(grouped.std()).reset_index().fillna(0)\n    the_std.columns = [group_column, '%s_std' % target_column]\n    the_median = pd.DataFrame(grouped.median()).reset_index()\n    the_median.columns = [group_column, '%s_median' % target_column]\n    the_stats = pd.merge(the_size, the_mean)\n    the_stats = pd.merge(the_stats, the_std)\n    the_stats = pd.merge(the_stats, the_median)\n\n    the_max = pd.DataFrame(grouped.max()).reset_index()\n    the_max.columns = [group_column, '%s_max' % target_column]\n    the_min = pd.DataFrame(grouped.min()).reset_index()\n    the_min.columns = [group_column, '%s_min' % target_column]\n\n    the_stats = pd.merge(the_stats, the_max)\n    the_stats = pd.merge(the_stats, the_min)\n\n    all_df = pd.merge(all_df, the_stats, how='left')\n\n    selected_train = all_df[all_df['train'] == 1]\n    selected_test = all_df[all_df['train'] == 0]\n    selected_train.sort_values('row_id', inplace=True)\n    selected_test.sort_values('row_id', inplace=True)\n    selected_train.drop([target_column, group_column, 'row_id', 'train'], axis=1, inplace=True)\n    selected_test.drop([target_column, group_column, 'row_id', 'train'], axis=1, inplace=True)\n\n    selected_train, selected_test = np.array(selected_train), np.array(selected_test)\n    print(selected_train.shape, selected_test.shape)\n    return selected_train, selected_test\n\n\ndef interaction_features(train, test, fea1, fea2, prefix):\n    train['inter_{}*'.format(prefix)] = train[fea1] * train[fea2]\n    train['inter_{}/'.format(prefix)] = train[fea1] / train[fea2]\n\n    test['inter_{}*'.format(prefix)] = test[fea1] * test[fea2]\n    test['inter_{}/'.format(prefix)] = test[fea1] / test[fea2]\n\n    return train, test\n"
  },
  {
    "path": "code_for_exact_solution/keras3.py",
    "content": "import os\nimport sys\nimport operator\nimport numpy as np\nimport pandas as pd\nfrom scipy import sparse\nimport xgboost as xgb\nfrom sklearn import model_selection, preprocessing, ensemble\nfrom keras.models import Sequential\nfrom keras.layers import Dense, Dropout, Activation\nfrom keras.layers.normalization import BatchNormalization\nfrom keras.layers.advanced_activations import PReLU\nfrom sklearn.preprocessing import OneHotEncoder\nfrom sklearn.model_selection import StratifiedKFold\nfrom time import time\nimport datetime\nfrom util import Gini, proj_num_on_cat, interaction_features, cat_count\nfrom sklearn.preprocessing import StandardScaler\nfrom itertools import combinations\nimport pickle\n\n'''\nsimple xgboost benchmark\n'''\nfrom sklearn.model_selection import StratifiedKFold, KFold\nfrom sklearn.feature_selection import SelectPercentile, f_classif\nimport numpy as np\nimport pandas as pd\nfrom util import Gini, proj_num_on_cat, interaction_features, cat_count\nfrom sklearn.preprocessing import LabelEncoder\nfrom itertools import combinations\nfrom util import proj_num_on_cat\nfrom sklearn.preprocessing import OneHotEncoder\nimport category_encoders as ce\nfrom scipy import sparse\nfrom sklearn.linear_model import LogisticRegression as LR\nfrom sklearn.preprocessing import StandardScaler\nfrom scipy.sparse import csr_matrix\nimport pickle\ncv_only = True\nsave_cv = True\nfull_train = True\n\ndef evalerror(preds, dtrain):\n    labels = dtrain.get_label()\n    return 'gini', Gini(labels, preds)\n\nNFOLDS = 5\nkfold = StratifiedKFold(n_splits=NFOLDS, shuffle=True, random_state=218)\n\ntrain = pd.read_csv(\"../input/train.csv\")\ntrain_label = train['target']\ntrain_id = train['id']\ndel train['target'], train['id']\n\ntest = pd.read_csv(\"../input/test.csv\")\ntest_id = test['id']\ndel test['id']\n\ncat_fea = [x for x in list(train) if 'cat' in x]\nbin_fea = [x for x in list(train) if 'bin' in x]\n\ntrain['missing'] = (train==-1).sum(axis=1).astype(float)\ntest['missing'] = (test==-1).sum(axis=1).astype(float)\n\n# include interactions\nfor e, (x, y) in enumerate(combinations(['ps_car_13', 'ps_ind_03', 'ps_reg_03', 'ps_ind_15', 'ps_reg_01', 'ps_ind_01'], 2)):\n    train, test = interaction_features(train, test, x, y, e)\n\nnum_features = [c for c in list(train) if ('cat' not in c and 'calc' not in c)]\nnum_features.append('missing')\ninter_fea = [x for x in list(train) if 'inter' in x]\n#train['cat_sum'] = train[cat_fea].sum(axis=1)\n#test['cat_sum'] = test[cat_fea].sum(axis=1)\n\n\n#X = train.as_matrix()\n#X_test = test.as_matrix()\n#print(X.shape, X_test.shape)\n#ohe\nohe = OneHotEncoder(sparse=True)\n\ntrain_cat = train[cat_fea].as_matrix()\ntrain_num = train[[x for x in list(train) if x in num_features]]\ntest_cat = test[cat_fea].as_matrix()\ntest_num = test[[x for x in list(train) if x in num_features]]\ntrain_cat[train_cat < 0] = 99\ntest_cat[test_cat < 0] = 99\n\ntraintest = np.vstack((train_cat, test_cat))\ntraintest = pd.DataFrame(traintest, columns=cat_fea)\nprint(traintest.shape)\n#encoder = ce.HelmertEncoder(cols=cat_fea)\n#encoder.fit(traintest)\n#train_enc = encoder.transform(pd.DataFrame(train_cat, columns=cat_fea))\n#test_enc = encoder.transform(pd.DataFrame(test_cat, columns=cat_fea))\nohe.fit(traintest)\ntrain_ohe = ohe.transform(train_cat)\ntest_ohe = ohe.transform(test_cat)\ndel traintest\n\ntrain_fea0, test_fea0 = pickle.load(open(\"../input/fea0.pk\"))\n\ncat_count_features = []\nfor c in cat_fea:\n    d = pd.concat([train[c],test[c]]).value_counts().to_dict()\n    train['%s_count'%c] = train[c].apply(lambda x:d.get(x,0))\n    test['%s_count'%c] = test[c].apply(lambda x:d.get(x,0))\n    cat_count_features.append('%s_count'%c)\n\ntrain_list = [train_num.replace([np.inf, -np.inf, np.nan], 0), train_ohe, train[cat_count_features], train_fea0]#, np.ones(shape=(train_num.shape[0], 1))]\ntest_list = [test_num.replace([np.inf, -np.inf, np.nan], 0), test_ohe, test[cat_count_features], test_fea0]#, np.ones(shape=(test_num.shape[0], 1))]\n\n#proj\nfor t in ['ps_car_13', 'ps_ind_03', 'ps_reg_03', 'ps_ind_15', 'ps_reg_01', 'ps_ind_01']:\n    for g in ['ps_car_13', 'ps_ind_03', 'ps_reg_03', 'ps_ind_15', 'ps_reg_01', 'ps_ind_01', 'ps_ind_05_cat']:\n        if t != g:\n            s_train, s_test = proj_num_on_cat(train, test, target_column=t, group_column=g)\n            train_list.append(s_train)\n            test_list.append(s_test)\nX = sparse.hstack(train_list).tocsr()\nX_test = sparse.hstack(test_list).tocsr()\n#X = train_num\n#X_test = test_num\nall_data = np.vstack([X.toarray(), X_test.toarray()])\n#all_data = np.vstack([X, X_test])\n\nscaler = StandardScaler()\nscaler.fit(all_data)\nX = scaler.transform(X.toarray())\nX_test = scaler.transform(X_test.toarray())\n#X = scaler.transform(X)\n#X_test = scaler.transform(X_test)\n\nprint(X.shape, X_test.shape)\n\n\ncv_train = np.zeros(len(train_label))\ncv_pred = np.zeros(len(test_id))\n\ndef nn_model():\n    model = Sequential()\n\n    model.add(Dense(512, input_dim=X.shape[1], init='he_normal'))\n    model.add(PReLU())\n    model.add(BatchNormalization())\n    model.add(Dropout(0.9))\n\n    model.add(Dense(64, init='he_normal'))\n    model.add(PReLU())\n    model.add(BatchNormalization())\n    model.add(Dropout(0.8))\n\n    model.add(Dense(1, init='he_normal', activation='sigmoid'))\n    model.compile(loss='binary_crossentropy', optimizer='adam')\n    return (model)\n\nnum_seeds = 5\nbegintime = time()\nif cv_only:\n    for s in xrange(num_seeds):\n        np.random.seed(s)\n        for (inTr, inTe) in kfold.split(X, train_label):\n            xtr = X[inTr]\n            ytr = train_label[inTr]\n            xte = X[inTe]\n            yte = train_label[inTe]\n\n            model = nn_model()\n            model.fit(xtr, ytr, epochs=35, batch_size=512, verbose=2, validation_data=[xte, yte])\n            cv_train[inTe] += model.predict_proba(x=xte, batch_size=512, verbose=0)[:, 0]\n            cv_pred += model.predict_proba(x=X_test, batch_size=512, verbose=0)[:, 0]\n        print(s)\n        print(Gini(train_label, cv_train / (1. * (s + 1))))\n        print(str(datetime.timedelta(seconds=time() - begintime)))\n    if save_cv:\n        pd.DataFrame({'id': test_id, 'target': cv_pred * 1./ (NFOLDS * num_seeds)}).to_csv('../model/keras3_pred.csv', index=False)\n        pd.DataFrame({'id': train_id, 'target': cv_train * 1. / num_seeds}).to_csv('../model/keras3_cv.csv', index=False)\n\n"
  },
  {
    "path": "code_for_exact_solution/keras6.py",
    "content": "import os\nimport sys\nimport operator\nimport numpy as np\nimport pandas as pd\nfrom scipy import sparse\nimport xgboost as xgb\nfrom sklearn import model_selection, preprocessing, ensemble\nfrom keras.models import Sequential\nfrom keras.layers import Dense, Dropout, Activation, Embedding, Flatten, concatenate, Input, merge\nfrom keras.layers.normalization import BatchNormalization\nfrom keras.layers.advanced_activations import PReLU\nfrom sklearn.preprocessing import OneHotEncoder\nfrom sklearn.model_selection import StratifiedKFold\nfrom time import time\nimport datetime\nfrom util import Gini, proj_num_on_cat, interaction_features, cat_count\nfrom sklearn.preprocessing import StandardScaler\nfrom itertools import combinations\nimport pickle\nfrom keras.models import Model\n'''\nsimple xgboost benchmark\n'''\nfrom sklearn.model_selection import StratifiedKFold, KFold\nfrom sklearn.feature_selection import SelectPercentile, f_classif\nimport numpy as np\nimport pandas as pd\nfrom util import Gini, proj_num_on_cat, interaction_features, cat_count\nfrom sklearn.preprocessing import LabelEncoder\nfrom itertools import combinations\nfrom util import proj_num_on_cat\nfrom sklearn.preprocessing import OneHotEncoder\nimport category_encoders as ce\nfrom scipy import sparse\nfrom sklearn.linear_model import LogisticRegression as LR\nfrom sklearn.preprocessing import StandardScaler\nfrom scipy.sparse import csr_matrix\nimport pickle\ncv_only = True\nsave_cv = True\nfull_train = True\n\ndef evalerror(preds, dtrain):\n    labels = dtrain.get_label()\n    return 'gini', Gini(labels, preds)\n\nNFOLDS = 5\nkfold = StratifiedKFold(n_splits=NFOLDS, shuffle=True, random_state=218)\n\ntrain = pd.read_csv(\"../input/train.csv\")\ntrain_label = train['target']\ntrain_id = train['id']\ndel train['target'], train['id']\n\ntest = pd.read_csv(\"../input/test.csv\")\ntest_id = test['id']\ndel test['id']\n\ncat_fea = [x for x in list(train) if 'cat' in x]\nbin_fea = [x for x in list(train) if 'bin' in x]\n\ntrain['missing'] = (train==-1).sum(axis=1).astype(float)\ntest['missing'] = (test==-1).sum(axis=1).astype(float)\n\n# include interactions\nfor e, (x, y) in enumerate(combinations(['ps_car_13', 'ps_ind_03', 'ps_reg_03', 'ps_ind_15', 'ps_reg_01', 'ps_ind_01'], 2)):\n    train, test = interaction_features(train, test, x, y, e)\n\nnum_features = [c for c in list(train) if ('cat' not in c and 'calc' not in c)]\nnum_features.append('missing')\ninter_fea = [x for x in list(train) if 'inter' in x]\n#train['cat_sum'] = train[cat_fea].sum(axis=1)\n#test['cat_sum'] = test[cat_fea].sum(axis=1)\n\npath = \"../input/\"\nnum_features_comb = []\nfor p in os.listdir(path):\n    if 'ps_reg_02___ps_car_07_cat' in p or 'ps_reg_01___ps_car_13___ps_car_15' in p:\n        print(p)\n        x,xt = pd.read_pickle(path+p)\n        train[p] = x\n        test[p] = xt\n        num_features_comb.append(p)\n\nnum_features += num_features_comb\n\nfeature_names = list(train)\nind_features = [c for c in feature_names if 'ind' in c]\ncount = 0\nfor c in ind_features:\n    if count == 0:\n        train['new_ind'] = train[c].astype(str)\n        count += 1\n    else:\n        train['new_ind'] += '_' + train[c].astype(str)\n\nprint(train['new_ind'].nunique())\nind_features = [c for c in feature_names if 'ind' in c]\ncount = 0\nfor c in ind_features:\n    if count == 0:\n        test['new_ind'] = test[c].astype(str)\n        count += 1\n    else:\n        test['new_ind'] += '_' + test[c].astype(str)\n\nreg_features = [c for c in feature_names if 'reg' in c]\ncount = 0\nfor c in reg_features:\n    if count == 0:\n        train['new_reg'] = train[c].astype(str)\n        count += 1\n    else:\n        train['new_reg'] += '_' + train[c].astype(str)\n\nprint(train['new_reg'].nunique())\nreg_features = [c for c in feature_names if 'reg' in c]\ncount = 0\nfor c in reg_features:\n    if count == 0:\n        test['new_reg'] = test[c].astype(str)\n        count += 1\n    else:\n        test['new_reg'] += '_' + test[c].astype(str)\n\ncar_features = [c for c in feature_names if 'car' in c]\ncount = 0\nfor c in car_features:\n    if count == 0:\n        train['new_car'] = train[c].astype(str)\n        count += 1\n    else:\n        train['new_car'] += '_' + train[c].astype(str)\n\nprint(train['new_car'].nunique())\ncar_features = [c for c in feature_names if 'car' in c]\ncount = 0\nfor c in car_features:\n    if count == 0:\n        test['new_car'] = test[c].astype(str)\n        count += 1\n    else:\n        test['new_car'] += '_' + test[c].astype(str)\n\nnew_ps_reg_03 = pd.read_pickle(path + 'new_ps_reg_03.pkl')\ntrain['ps_reg_03'] = new_ps_reg_03[:train.shape[0]]\ntest['ps_reg_03'] = new_ps_reg_03[train.shape[0]:]\nprint(train['ps_reg_03'].head(10))\n\n\n#X = train.as_matrix()\n#X_test = test.as_matrix()\n#print(X.shape, X_test.shape)\n#ohe\nfrom sklearn.preprocessing import LabelEncoder\n\ntrain_cat = train[cat_fea]\ntrain_num = train[[x for x in list(train) if x in num_features]]\ntest_cat = test[cat_fea]\ntest_num = test[[x for x in list(train) if x in num_features]]\n\nmax_cat_values = []\nfor c in cat_fea:\n    le = LabelEncoder()\n    x = le.fit_transform(pd.concat([train_cat, test_cat])[c])\n    train_cat[c] = le.transform(train_cat[c])\n    test_cat[c] = le.transform(test_cat[c])\n    max_cat_values.append(np.max(x))\n\n\n#train_fea0, test_fea0 = pickle.load(open(\"../input/fea0.pk\"))\n\ncat_count_features = []\nfor c in cat_fea + ['new_ind','new_reg','new_car']:\n    d = pd.concat([train[c],test[c]]).value_counts().to_dict()\n    train['%s_count'%c] = train[c].apply(lambda x:d.get(x,0))\n    test['%s_count'%c] = test[c].apply(lambda x:d.get(x,0))\n    cat_count_features.append('%s_count'%c)\n\n\nprint(train_num.dtypes)\ntrain_list = [train_num.replace([np.inf, -np.inf, np.nan], 0), train[cat_count_features]]#, np.ones(shape=(train_num.shape[0], 1))]\ntest_list = [test_num.replace([np.inf, -np.inf, np.nan], 0), test[cat_count_features]]#, np.ones(shape=(test_num.shape[0], 1))]\n\n#proj\nfor t in ['ps_car_13', 'ps_ind_03', 'ps_reg_03', 'ps_ind_15', 'ps_reg_01', 'ps_ind_01']:\n    for g in ['ps_car_13', 'ps_ind_03', 'ps_reg_03', 'ps_ind_15', 'ps_reg_01', 'ps_ind_01', 'ps_ind_05_cat']:\n        if t != g:\n            s_train, s_test = proj_num_on_cat(train, test, target_column=t, group_column=g)\n            train_list.append(s_train)\n            test_list.append(s_test)\nX = sparse.hstack(train_list).tocsr()\nX_test = sparse.hstack(test_list).tocsr()\n#X = train_num\n#X_test = test_num\nall_data = np.vstack([X.toarray(), X_test.toarray()])\n#all_data = np.vstack([X, X_test])\n\nscaler = StandardScaler()\nscaler.fit(all_data)\nX = scaler.transform(X.toarray())\nX_test = scaler.transform(X_test.toarray())\n#X = scaler.transform(X)\n#X_test = scaler.transform(X_test)\n\nprint(X.shape, X_test.shape)\n\n\ncv_train = np.zeros(len(train_label))\ncv_pred = np.zeros(len(test_id))\n\ndef nn_model():\n    inputs = []\n    flatten_layers = []\n    for e, c in enumerate(cat_fea):\n        input_c = Input(shape=(1, ), dtype='int32')\n        num_c = max_cat_values[e]\n        embed_c = Embedding(\n            num_c,\n            64,\n            input_length=1\n        )(embed_c)\n        embed_c = Dropout(0.25)(embed_c)\n        flatten_c = Flatten()(embed_c)\n\n        inputs.append(input_c)\n        flatten_layers.append(flatten_c)\n\n    input_num = Input(shape=(X.shape[1],), dtype='float32')\n    flatten_layers.append(input_num)\n    inputs.append(input_num)\n\n    flatten = merge(flatten_layers, mode='concat')\n\n    fc1 = Dense(512, kernel_init='he_normal')(flatten)\n    fc1 = PReLU()(fc1)\n    fc1 = BatchNormalization()(fc1)\n    fc1 = Dropout(0.8)(fc1)\n\n    fc1 = Dense(64, kernel_init='he_normal')(fc1)\n    fc1 = PReLU()(fc1)\n    fc1 = BatchNormalization()(fc1)\n    fc1 = Dropout(0.8)(fc1)\n\n    outputs = Dense(1, kernel_init='he_normal', activation='sigmoid')(fc1)\n\n    model = Model(input = inputs, output = outputs)\n    model.compile(loss='binary_crossentropy', optimizer='adam')\n    return (model)\n\n\nX_cat = train_cat.as_matrix()\nX_test_cat = test_cat.as_matrix()\n\nx_test_cat = []\nfor i in xrange(X_test_cat.shape[1]):\n    x_test_cat.append(X_test_cat[:, i].reshape(-1, 1))\nx_test_cat.append(X_test)\nnum_seeds = 5\n\ndef nn_model():\n    inputs = []\n    flatten_layers = []\n    for e, c in enumerate(cat_fea):\n        input_c = Input(shape=(1, ), dtype='int32')\n        num_c = max_cat_values[e]\n        embed_c = Embedding(\n            num_c,\n            6,\n            input_length=1\n        )(input_c)\n        embed_c = Dropout(0.25)(embed_c)\n        flatten_c = Flatten()(embed_c)\n\n        inputs.append(input_c)\n        flatten_layers.append(flatten_c)\n\n    input_num = Input(shape=(X.shape[1],), dtype='float32')\n    flatten_layers.append(input_num)\n    inputs.append(input_num)\n\n    flatten = merge(flatten_layers, mode='concat')\n\n    fc1 = Dense(512, init='he_normal')(flatten)\n    fc1 = PReLU()(fc1)\n    fc1 = BatchNormalization()(fc1)\n    fc1 = Dropout(0.75)(fc1)\n\n    fc1 = Dense(64, init='he_normal')(fc1)\n    fc1 = PReLU()(fc1)\n    fc1 = BatchNormalization()(fc1)\n    fc1 = Dropout(0.5)(fc1)\n\n    outputs = Dense(1, init='he_normal', activation='sigmoid')(fc1)\n\n    model = Model(input = inputs, output = outputs)\n    model.compile(loss='binary_crossentropy', optimizer='adam')\n    return (model)\n\nnum_seeds = 5\nbegintime = time()\nif cv_only:\n    for s in xrange(num_seeds):\n        np.random.seed(s)\n        for (inTr, inTe) in kfold.split(X, train_label):\n            xtr = X[inTr]\n            ytr = train_label[inTr]\n            xte = X[inTe]\n            yte = train_label[inTe]\n\n            xtr_cat = X_cat[inTr]\n            xte_cat = X_cat[inTe]\n\n            # get xtr xte cat\n            xtr_cat_list, xte_cat_list = [], []\n            for i in xrange(xtr_cat.shape[1]):\n                xtr_cat_list.append(xtr_cat[:, i].reshape(-1, 1))\n                xte_cat_list.append(xte_cat[:, i].reshape(-1, 1))\n\n            xtr_cat_list.append(xtr)\n            xte_cat_list.append(xte)\n\n            model = nn_model()\n            def get_rank(x):\n                return pd.Series(x).rank(pct=True).values\n            model.fit(xtr_cat_list, ytr, epochs=20, batch_size=512, verbose=2, validation_data=[xte_cat_list, yte])\n            cv_train[inTe] += get_rank(model.predict(x=xte_cat_list, batch_size=512, verbose=0)[:, 0])\n            print(Gini(train_label[inTe], cv_train[inTe]))\n            cv_pred += get_rank(model.predict(x=x_test_cat, batch_size=512, verbose=0)[:, 0])\n        print(s)\n        print(Gini(train_label, cv_train / (1. * (s + 1))))\n        print(str(datetime.timedelta(seconds=time() - begintime)))\n    if save_cv:\n        pd.DataFrame({'id': test_id, 'target': get_rank(cv_pred * 1./ (NFOLDS * num_seeds))}).to_csv('../model/keras6_pred.csv', index=False)\n        pd.DataFrame({'id': train_id, 'target': get_rank(cv_train * 1. / num_seeds)}).to_csv('../model/keras6_cv.csv', index=False)\n\n"
  },
  {
    "path": "code_for_exact_solution/keras7.py",
    "content": "import os\nimport sys\nimport operator\nimport numpy as np\nimport pandas as pd\nfrom scipy import sparse\nimport xgboost as xgb\nfrom sklearn import model_selection, preprocessing, ensemble\nfrom keras.models import Sequential\nfrom keras.layers import Dense, Dropout, Activation, Embedding, Flatten, concatenate, Input, merge\nfrom keras.layers.normalization import BatchNormalization\nfrom keras.layers.advanced_activations import PReLU\nfrom sklearn.preprocessing import OneHotEncoder\nfrom sklearn.model_selection import StratifiedKFold\nfrom time import time\nimport datetime\nfrom util import Gini, proj_num_on_cat, interaction_features, cat_count\nfrom sklearn.preprocessing import StandardScaler\nfrom itertools import combinations\nimport pickle\nfrom keras.models import Model\n'''\nsimple xgboost benchmark\n'''\nfrom sklearn.model_selection import StratifiedKFold, KFold\nfrom sklearn.feature_selection import SelectPercentile, f_classif\nimport numpy as np\nimport pandas as pd\nfrom util import Gini, proj_num_on_cat, interaction_features, cat_count\nfrom sklearn.preprocessing import LabelEncoder\nfrom itertools import combinations\nfrom util import proj_num_on_cat\nfrom sklearn.preprocessing import OneHotEncoder\nimport category_encoders as ce\nfrom scipy import sparse\nfrom sklearn.linear_model import LogisticRegression as LR\nfrom sklearn.preprocessing import StandardScaler\nfrom scipy.sparse import csr_matrix\nimport pickle\ncv_only = True\nsave_cv = True\nfull_train = True\n\ndef evalerror(preds, dtrain):\n    labels = dtrain.get_label()\n    return 'gini', Gini(labels, preds)\n\nNFOLDS = 5\nkfold = StratifiedKFold(n_splits=NFOLDS, shuffle=True, random_state=218)\n\ntrain = pd.read_csv(\"../input/train.csv\")\ntrain_label = train['target']\ntrain_id = train['id']\ndel train['target'], train['id']\n\ntest = pd.read_csv(\"../input/test.csv\")\ntest_id = test['id']\ndel test['id']\n\ncat_fea = [x for x in list(train) if 'cat' in x]\nbin_fea = [x for x in list(train) if 'bin' in x]\n\ntrain['missing'] = (train==-1).sum(axis=1).astype(float)\ntest['missing'] = (test==-1).sum(axis=1).astype(float)\n\n# include interactions\nfor e, (x, y) in enumerate(combinations(['ps_car_13', 'ps_ind_03', 'ps_reg_03', 'ps_ind_15', 'ps_reg_01', 'ps_ind_01'], 2)):\n    train, test = interaction_features(train, test, x, y, e)\n\nnum_features = [c for c in list(train) if ('cat' not in c and 'calc' not in c)]\nnum_features.append('missing')\ninter_fea = [x for x in list(train) if 'inter' in x]\n#train['cat_sum'] = train[cat_fea].sum(axis=1)\n#test['cat_sum'] = test[cat_fea].sum(axis=1)\n\npath = \"../input/\"\nnum_features_comb = []\nfor p in os.listdir(path):\n    if 'ps_reg_02___ps_car_07_cat' in p or 'ps_reg_01___ps_car_13___ps_car_15' in p:\n        print(p)\n        x,xt = pd.read_pickle(path+p)\n        train[p] = x\n        test[p] = xt\n        num_features_comb.append(p)\n\nnum_features += num_features_comb\n\nfeature_names = list(train)\nind_features = [c for c in feature_names if 'ind' in c]\ncount = 0\nfor c in ind_features:\n    if count == 0:\n        train['new_ind'] = train[c].astype(str)\n        count += 1\n    else:\n        train['new_ind'] += '_' + train[c].astype(str)\n\nprint(train['new_ind'].nunique())\nind_features = [c for c in feature_names if 'ind' in c]\ncount = 0\nfor c in ind_features:\n    if count == 0:\n        test['new_ind'] = test[c].astype(str)\n        count += 1\n    else:\n        test['new_ind'] += '_' + test[c].astype(str)\n\nreg_features = [c for c in feature_names if 'reg' in c]\ncount = 0\nfor c in reg_features:\n    if count == 0:\n        train['new_reg'] = train[c].astype(str)\n        count += 1\n    else:\n        train['new_reg'] += '_' + train[c].astype(str)\n\nprint(train['new_reg'].nunique())\nreg_features = [c for c in feature_names if 'reg' in c]\ncount = 0\nfor c in reg_features:\n    if count == 0:\n        test['new_reg'] = test[c].astype(str)\n        count += 1\n    else:\n        test['new_reg'] += '_' + test[c].astype(str)\n\ncar_features = [c for c in feature_names if 'car' in c]\ncount = 0\nfor c in car_features:\n    if count == 0:\n        train['new_car'] = train[c].astype(str)\n        count += 1\n    else:\n        train['new_car'] += '_' + train[c].astype(str)\n\nprint(train['new_car'].nunique())\ncar_features = [c for c in feature_names if 'car' in c]\ncount = 0\nfor c in car_features:\n    if count == 0:\n        test['new_car'] = test[c].astype(str)\n        count += 1\n    else:\n        test['new_car'] += '_' + test[c].astype(str)\n\nnew_ps_reg_03 = pd.read_pickle(path + 'new_ps_reg_03.pkl')\ntrain['ps_reg_03'] = new_ps_reg_03[:train.shape[0]]\ntest['ps_reg_03'] = new_ps_reg_03[train.shape[0]:]\nprint(train['ps_reg_03'].head(10))\n\n\n#X = train.as_matrix()\n#X_test = test.as_matrix()\n#print(X.shape, X_test.shape)\n#ohe\nfrom sklearn.preprocessing import LabelEncoder\n\ntrain_cat = train[cat_fea]\ntrain_num = train[[x for x in list(train) if x in num_features]]\ntest_cat = test[cat_fea]\ntest_num = test[[x for x in list(train) if x in num_features]]\n\nmax_cat_values = []\nfor c in cat_fea:\n    le = LabelEncoder()\n    x = le.fit_transform(pd.concat([train_cat, test_cat])[c])\n    train_cat[c] = le.transform(train_cat[c])\n    test_cat[c] = le.transform(test_cat[c])\n    max_cat_values.append(np.max(x))\n\n\ntrain_fea0, test_fea0 = pickle.load(open(\"../input/fea0.pk\"))\n\ncat_count_features = []\nfor c in cat_fea + ['new_ind','new_reg','new_car']:\n    d = pd.concat([train[c],test[c]]).value_counts().to_dict()\n    train['%s_count'%c] = train[c].apply(lambda x:d.get(x,0))\n    test['%s_count'%c] = test[c].apply(lambda x:d.get(x,0))\n    cat_count_features.append('%s_count'%c)\n\n\nprint(train_num.dtypes)\ntrain_list = [train_num.replace([np.inf, -np.inf, np.nan], 0), train[cat_count_features], train_fea0]#, np.ones(shape=(train_num.shape[0], 1))]\ntest_list = [test_num.replace([np.inf, -np.inf, np.nan], 0), test[cat_count_features], test_fea0]#, np.ones(shape=(test_num.shape[0], 1))]\n\n#proj\nfor t in ['ps_car_13', 'ps_ind_03', 'ps_reg_03', 'ps_ind_15', 'ps_reg_01', 'ps_ind_01']:\n    for g in ['ps_car_13', 'ps_ind_03', 'ps_reg_03', 'ps_ind_15', 'ps_reg_01', 'ps_ind_01', 'ps_ind_05_cat']:\n        if t != g:\n            s_train, s_test = proj_num_on_cat(train, test, target_column=t, group_column=g)\n            train_list.append(s_train)\n            test_list.append(s_test)\nX = sparse.hstack(train_list).tocsr()\nX_test = sparse.hstack(test_list).tocsr()\n#X = train_num\n#X_test = test_num\nall_data = np.vstack([X.toarray(), X_test.toarray()])\n#all_data = np.vstack([X, X_test])\n\nscaler = StandardScaler()\nscaler.fit(all_data)\nX = scaler.transform(X.toarray())\nX_test = scaler.transform(X_test.toarray())\n#X = scaler.transform(X)\n#X_test = scaler.transform(X_test)\n\nprint(X.shape, X_test.shape)\n\n\ncv_train = np.zeros(len(train_label))\ncv_pred = np.zeros(len(test_id))\n\nX_cat = train_cat.as_matrix()\nX_test_cat = test_cat.as_matrix()\n\nx_test_cat = []\nfor i in xrange(X_test_cat.shape[1]):\n    x_test_cat.append(X_test_cat[:, i].reshape(-1, 1))\nx_test_cat.append(X_test)\n\nX_cat = train_cat.as_matrix()\nX_test_cat = test_cat.as_matrix()\n\nx_test_cat = []\nfor i in xrange(X_test_cat.shape[1]):\n    x_test_cat.append(X_test_cat[:, i].reshape(-1, 1))\nx_test_cat.append(X_test)\nnum_seeds = 5\n\ndef nn_model():\n    inputs = []\n    flatten_layers = []\n    for e, c in enumerate(cat_fea):\n        input_c = Input(shape=(1, ), dtype='int32')\n        num_c = max_cat_values[e]\n        embed_c = Embedding(\n            num_c,\n            3,\n            input_length=1\n        )(input_c)\n        embed_c = Dropout(0.)(embed_c)\n        flatten_c = Flatten()(embed_c)\n\n        inputs.append(input_c)\n        flatten_layers.append(flatten_c)\n\n    input_num = Input(shape=(X.shape[1],), dtype='float32')\n    flatten_layers.append(input_num)\n    inputs.append(input_num)\n\n    flatten = merge(flatten_layers, mode='concat')\n\n    fc1 = Dense(512, init='he_normal')(flatten)\n    fc1 = PReLU()(fc1)\n    fc1 = BatchNormalization()(fc1)\n    fc1 = Dropout(0.8)(fc1)\n\n    fc1 = Dense(64, init='he_normal')(fc1)\n    fc1 = PReLU()(fc1)\n    fc1 = BatchNormalization()(fc1)\n    fc1 = Dropout(0.8)(fc1)\n\n    outputs = Dense(1, init='he_normal', activation='sigmoid')(fc1)\n\n    model = Model(input = inputs, output = outputs)\n    model.compile(loss='binary_crossentropy', optimizer='adam')\n    return (model)\n\nnum_seeds = 3\nbegintime = time()\nif cv_only:\n    for s in xrange(num_seeds):\n        np.random.seed(s)\n        for (inTr, inTe) in kfold.split(X, train_label):\n            xtr = X[inTr]\n            ytr = train_label[inTr]\n            xte = X[inTe]\n            yte = train_label[inTe]\n\n            xtr_cat = X_cat[inTr]\n            xte_cat = X_cat[inTe]\n\n            # get xtr xte cat\n            xtr_cat_list, xte_cat_list = [], []\n            for i in xrange(xtr_cat.shape[1]):\n                xtr_cat_list.append(xtr_cat[:, i].reshape(-1, 1))\n                xte_cat_list.append(xte_cat[:, i].reshape(-1, 1))\n\n            xtr_cat_list.append(xtr)\n            xte_cat_list.append(xte)\n\n            model = nn_model()\n            def get_rank(x):\n                return pd.Series(x).rank(pct=True).values\n            model.fit(xtr_cat_list, ytr, epochs=30, batch_size=512, verbose=2, validation_data=[xte_cat_list, yte])\n            cv_train[inTe] += get_rank(model.predict(x=xte_cat_list, batch_size=512, verbose=0)[:, 0])\n            print(Gini(train_label[inTe], cv_train[inTe]))\n            cv_pred += get_rank(model.predict(x=x_test_cat, batch_size=512, verbose=0)[:, 0])\n        print(s)\n        print(Gini(train_label, cv_train / (1. * (s + 1))))\n        print(str(datetime.timedelta(seconds=time() - begintime)))\n        pd.DataFrame({'id': test_id, 'target': get_rank(cv_pred * 1./ (NFOLDS * num_seeds))}).to_csv('../model/keras7_pred_0.csv', index=False)\n        pd.DataFrame({'id': train_id, 'target': get_rank(cv_train * 1. / num_seeds)}).to_csv('../model/keras7_cv_0.csv', index=False)\n\n    if save_cv:\n        pd.DataFrame({'id': test_id, 'target': get_rank(cv_pred * 1./ (NFOLDS * num_seeds))}).to_csv('../model/keras7_pred.csv', index=False)\n        pd.DataFrame({'id': train_id, 'target': get_rank(cv_train * 1. / num_seeds)}).to_csv('../model/keras7_cv.csv', index=False)\n"
  },
  {
    "path": "code_for_exact_solution/lightgbm1.py",
    "content": "'''\nsimple xgboost benchmark\n'''\nimport lightgbm as lgbm\nfrom sklearn.model_selection import StratifiedKFold\nimport numpy as np\nimport pandas as pd\nfrom util import Gini\nfrom sklearn.preprocessing import LabelEncoder\nfrom itertools import combinations\nfrom util import proj_num_on_cat, cat_count\nfrom sklearn.preprocessing import OneHotEncoder\nfrom scipy import sparse\n\ncv_only = True\nsave_cv = True\nfull_train = False\n\ndef evalerror(preds, dtrain):\n    labels = dtrain.get_label()\n    return 'gini', Gini(labels, preds), True\n\nNFOLDS = 5\nkfold = StratifiedKFold(n_splits=NFOLDS, shuffle=True, random_state=218)\nlearning_rate = 0.1\nnum_leaves = 15\nmin_data_in_leaf = 2000\n# max_bin = x\nfeature_fraction = 0.6\nnum_boost_round = 10000\n\ntrain = pd.read_csv(\"../input/train.csv\")\ntrain_label = train['target']\ntrain_id = train['id']\ndel train['target'], train['id']\n\ntest = pd.read_csv(\"../input/test.csv\")\ntest_id = test['id']\ndel test['id']\ncat_list = [x for x in list(train) if 'cat' in x]\nprint(cat_list)\n\ntrain_copy = train.copy()\ntest_copy = test.copy()\ntrain_copy = train_copy.replace(-1, np.NaN)\ntest_copy = test_copy.replace(-1, np.NaN)\ntrain['num_na'] = train_copy.isnull().sum(axis=1)\ntest['num_na'] = test_copy.isnull().sum(axis=1)\ndel train_copy, test_copy\n\n#X = train.as_matrix()\n#X_test = test.as_matrix()\n#print(X.shape, X_test.shape)\n#ohe\nohe = OneHotEncoder(sparse=False)\ntrain_cat = train[[x for x in list(train) if 'cat' in x]].as_matrix()\ntrain_num = train[[x for x in list(train) if 'cat' not in x]]\ntest_cat = test[[x for x in list(test) if 'cat' in x]].as_matrix()\ntest_num = test[[x for x in list(test) if 'cat' not in x]]\ntrain_cat[train_cat < 0] = 99\ntest_cat[test_cat < 0] = 99\n\ntrain_ohe = ohe.fit_transform(train_cat)\ntest_ohe = ohe.transform(test_cat)\n\n\nprint(\"cat_list now:\", cat_list)\ntrain_cat_count, test_cat_count = cat_count(train, test, cat_list)\nprint(\"cat count shape:\", train_cat_count.shape, test_cat_count.shape)\n\nX = sparse.hstack([train_num, train_ohe, train_cat_count]).tocsr()\nX_test = sparse.hstack([test_num, test_ohe, test_cat_count]).tocsr()\nprint(X.shape, X_test.shape)\n\nparams = {\"objective\": \"binary\",\n          \"boosting_type\": \"gbdt\",\n          \"learning_rate\": learning_rate,\n          \"num_leaves\": int(num_leaves),\n           \"max_bin\": 256,\n          \"min_data_in_leaf\": min_data_in_leaf,\n          \"feature_fraction\": feature_fraction,\n          \"verbosity\": 0,\n          \"seed\": 218,\n          \"drop_rate\": 0.1,\n          \"is_unbalance\": False,\n          \"max_drop\": 50,\n          \"min_child_samples\": 10,\n          \"min_child_weight\": 150,\n          \"min_split_gain\": 0,\n          \"subsample\": 0.9\n          }\n\nx_score = []\nfinal_cv_train = np.zeros(len(train_label))\nfinal_cv_pred = np.zeros(len(test_id))\nfor s in xrange(16):\n    cv_train = np.zeros(len(train_label))\n    cv_pred = np.zeros(len(test_id))\n\n    params['seed'] = s\n\n    if cv_only:\n        kf = kfold.split(X, train_label)\n\n        best_trees = []\n        fold_scores = []\n\n        for i, (train_fold, validate) in enumerate(kf):\n            X_train, X_validate, label_train, label_validate = \\\n                X[train_fold, :], X[validate, :], train_label[train_fold], train_label[validate]\n            dtrain = lgbm.Dataset(X_train, label_train)\n            dvalid = lgbm.Dataset(X_validate, label_validate, reference=dtrain)\n            bst = lgbm.train(params, dtrain, num_boost_round, valid_sets=dvalid, feval=evalerror, verbose_eval=100,\n                            early_stopping_rounds=100)\n            best_trees.append(bst.best_iteration)\n            cv_pred += bst.predict(X_test, num_iteration=bst.best_iteration)\n            cv_train[validate] += bst.predict(X_validate)\n\n            score = Gini(label_validate, cv_train[validate])\n            print score\n            fold_scores.append(score)\n\n        cv_pred /= NFOLDS\n        final_cv_train += cv_train\n        final_cv_pred += cv_pred\n\n        print(\"cv score:\")\n        print Gini(train_label, cv_train)\n        print \"current score:\", Gini(train_label, final_cv_train / (s + 1.)), s+1\n        print(fold_scores)\n        print(best_trees, np.mean(best_trees))\n\n        x_score.append(Gini(train_label, cv_train))\n\nprint(x_score)\npd.DataFrame({'id': test_id, 'target': final_cv_pred / 16.}).to_csv('../model/lgbm1_pred_avg.csv', index=False)\npd.DataFrame({'id': train_id, 'target': final_cv_train / 16.}).to_csv('../model/lgbm1_cv_avg.csv', index=False)"
  },
  {
    "path": "code_for_exact_solution/lightgbm5.py",
    "content": "import numpy as np\nimport scipy as sp\nfrom scipy import sparse as ssp\nimport pandas as pd\nfrom sklearn.preprocessing import normalize\nfrom scipy.stats import spearmanr\nimport lightgbm as lgb\nimport lightgbm as lgbm\nfrom sklearn.model_selection import StratifiedKFold\nimport numpy as np\nimport pandas as pd\nfrom util import Gini\nfrom sklearn.preprocessing import LabelEncoder\nfrom itertools import combinations\nfrom util import proj_num_on_cat\nfrom sklearn.preprocessing import OneHotEncoder\nfrom scipy import sparse\nimport lightgbm as lgbm\nfrom sklearn.model_selection import StratifiedKFold\nimport numpy as np\nimport pandas as pd\nfrom util import Gini\nfrom sklearn.preprocessing import LabelEncoder\nfrom itertools import combinations\nfrom util import proj_num_on_cat, cat_count\nfrom sklearn.preprocessing import OneHotEncoder\nfrom scipy import sparse\n\ncv_only = True\nsave_cv = True\nfull_train = False\n\nimport numpy as np\nimport scipy as sp\nfrom scipy import sparse as ssp\nimport pandas as pd\nimport xgboost as xgb\nfrom sklearn.model_selection import StratifiedKFold, KFold\nimport numpy as np\nimport pandas as pd\nfrom sklearn.preprocessing import LabelEncoder\nfrom itertools import combinations\nfrom util import proj_num_on_cat\nfrom sklearn.preprocessing import OneHotEncoder\nimport category_encoders as ce\nfrom scipy import sparse\nfrom util import Gini, proj_num_on_cat, interaction_features, cat_count\n\n\ndef evalerror(preds, dtrain):\n    labels = dtrain.get_label()\n    return 'gini', Gini(labels, preds), True\n\npath = \"../input/\"\n\ntrain = pd.read_csv(path+'train.csv')\ntrain_label = train['target']\ntrain_id = train['id']\ntest = pd.read_csv(path+'test.csv')\ntest_id = test['id']\n\nNFOLDS = 5\nkfold = StratifiedKFold(n_splits=NFOLDS, shuffle=True, random_state=218)\nlearning_rate = 0.1\nnum_leaves = 15\nmin_data_in_leaf = 2000\n# max_bin = x\nfeature_fraction = 0.6\nnum_boost_round = 10000\n\ny = train['target'].values\ndrop_feature = [\n    'id',\n    'target'\n]\n\nX = train.drop(drop_feature,axis=1)\nfeature_names = X.columns.tolist()\ncat_features = [c for c in feature_names if ('cat' in c and 'count' not in c)]\nnum_features = [c for c in feature_names if ('cat' not in c and 'calc' not in c)]\n\ntrain['missing'] = (train==-1).sum(axis=1).astype(float)\ntest['missing'] = (test==-1).sum(axis=1).astype(float)\nnum_features.append('missing')\n\nfrom sklearn.preprocessing import OneHotEncoder,LabelEncoder\nfor c in cat_features:\n    le = LabelEncoder()\n    le.fit(train[c])\n    train[c] = le.transform(train[c])\n    test[c] = le.transform(test[c])\n\nfrom sklearn.preprocessing import normalize\nfrom scipy.stats import spearmanr\nfrom sklearn.preprocessing import OneHotEncoder\nenc = OneHotEncoder()\nenc.fit(train[cat_features])\nX_cat = enc.transform(train[cat_features])\nX_t_cat = enc.transform(test[cat_features])\n\nind_features = [c for c in feature_names if 'ind' in c]\ncount=0\nfor c in ind_features:\n    if count==0:\n        train['new_ind'] = train[c].astype(str)+'_'\n        test['new_ind'] = test[c].astype(str)+'_'\n        count+=1\n    else:\n        train['new_ind'] += train[c].astype(str)+'_'\n        test['new_ind'] += test[c].astype(str)+'_'\n\ncat_count_features = []\nfor c in cat_features+['new_ind']:\n    d = pd.concat([train[c],test[c]]).value_counts().to_dict()\n    train['%s_count'%c] = train[c].apply(lambda x:d.get(x,0))\n    test['%s_count'%c] = test[c].apply(lambda x:d.get(x,0))\n    cat_count_features.append('%s_count'%c)\n    score = spearmanr(train['target'],train['%s_count'%c])\n    print(c,score)\n\ntrain_list = [train[num_features+cat_count_features].values,X_cat,]\ntest_list = [test[num_features+cat_count_features].values,X_t_cat,]\n\n# missing binary projections\n#missing_list = [x for x in list(train) if np.sum(train[x] == -1) > 0]\n#for miss_fea in missing_list:\n#    train['{}_miss_code'.format(miss_fea)] = (train[miss_fea] == -1).astype(int)\n#    test['{}_miss_code'.format(miss_fea)] = (test[miss_fea] == -1).astype(int)\n\nX = ssp.hstack(train_list).tocsr()\nX_test = ssp.hstack(test_list).tocsr()\n\n\nparams = {\"objective\": \"poisson\",\n          \"boosting_type\": \"gbdt\",\n          \"learning_rate\": learning_rate,\n          \"num_leaves\": int(num_leaves),\n           \"max_bin\": 256,\n          \"feature_fraction\": feature_fraction,\n          \"verbosity\": 0,\n          \"drop_rate\": 0.1,\n          \"is_unbalance\": False,\n          \"max_drop\": 50,\n          \"min_child_samples\": 10,\n          \"min_child_weight\": 150,\n          \"min_split_gain\": 0,\n          \"subsample\": 0.9\n          }\n\nx_score = []\nfinal_cv_train = np.zeros(len(train_label))\nfinal_cv_pred = np.zeros(len(test_id))\nfor s in xrange(10):\n    cv_train = np.zeros(len(train_label))\n    cv_pred = np.zeros(len(test_id))\n\n    params['seed'] = s\n\n    if cv_only:\n        kf = kfold.split(X, train_label)\n\n        best_trees = []\n        fold_scores = []\n\n        for i, (train_fold, validate) in enumerate(kf):\n            X_train, X_validate, label_train, label_validate = \\\n                X[train_fold, :], X[validate, :], train_label[train_fold], train_label[validate]\n            dtrain = lgbm.Dataset(X_train, label_train)\n            dvalid = lgbm.Dataset(X_validate, label_validate, reference=dtrain)\n            bst = lgbm.train(params, dtrain, num_boost_round, valid_sets=dvalid, feval=evalerror, verbose_eval=100,\n                            early_stopping_rounds=100)\n            best_trees.append(bst.best_iteration)\n            cv_pred += bst.predict(X_test, num_iteration=bst.best_iteration)\n            cv_train[validate] += bst.predict(X_validate)\n\n            score = Gini(label_validate, cv_train[validate])\n            print score\n            fold_scores.append(score)\n\n        cv_pred /= NFOLDS\n        final_cv_train += cv_train\n        final_cv_pred += cv_pred\n\n        print(\"cv score:\")\n        print Gini(train_label, cv_train)\n        print \"current score:\", Gini(train_label, final_cv_train / (s + 1.)), s+1\n        print(fold_scores)\n        print(best_trees, np.mean(best_trees))\n\n        x_score.append(Gini(train_label, cv_train))\n\nprint(x_score)\n\npred_result = pd.DataFrame({'id': test_id, 'target': final_cv_pred / 10.})\npred_result['target'] = pred_result['target'].rank(pct=True)\npred_result.to_csv('../model/lgbm5_pred_avg.csv', index=False)\n\ncv_result = pd.DataFrame({'id': train_id, 'target': final_cv_train / 10.})\ncv_result['target'] = cv_result['target'].rank(pct=True)\ncv_result.to_csv('../model/lgbm5_cv_avg.csv', index=False)\n\n#cv score:\n#0.287007087138\n#current score: 0.289683837899 16\n"
  },
  {
    "path": "code_for_exact_solution/lightgbm6.py",
    "content": "import numpy as np\nimport scipy as sp\nfrom scipy import sparse as ssp\nimport pandas as pd\nfrom sklearn.preprocessing import normalize\nfrom scipy.stats import spearmanr\nimport lightgbm as lgb\nimport lightgbm as lgbm\nfrom sklearn.model_selection import StratifiedKFold\nimport numpy as np\nimport pandas as pd\nfrom util import Gini\nfrom sklearn.preprocessing import LabelEncoder\nfrom itertools import combinations\nfrom util import proj_num_on_cat\nfrom sklearn.preprocessing import OneHotEncoder\nfrom scipy import sparse\nimport lightgbm as lgbm\nfrom sklearn.model_selection import StratifiedKFold\nimport numpy as np\nimport pandas as pd\nfrom util import Gini\nfrom sklearn.preprocessing import LabelEncoder\nfrom itertools import combinations\nfrom util import proj_num_on_cat, cat_count\nfrom sklearn.preprocessing import OneHotEncoder\nfrom scipy import sparse\n\ncv_only = True\nsave_cv = True\nfull_train = False\n\nimport numpy as np\nimport scipy as sp\nfrom scipy import sparse as ssp\nimport pandas as pd\nimport xgboost as xgb\nfrom sklearn.model_selection import StratifiedKFold, KFold\nimport numpy as np\nimport pandas as pd\nfrom sklearn.preprocessing import LabelEncoder\nfrom itertools import combinations\nfrom util import proj_num_on_cat\nfrom sklearn.preprocessing import OneHotEncoder\nimport category_encoders as ce\nfrom scipy import sparse\nfrom util import Gini, proj_num_on_cat, interaction_features, cat_count\n\n\ndef evalerror(preds, dtrain):\n    labels = dtrain.get_label()\n    return 'gini', Gini(labels, preds), True\n\npath = \"../input/\"\n\ntrain = pd.read_csv(path+'train.csv')\ntrain_label = train['target']\ntrain_id = train['id']\ntest = pd.read_csv(path+'test.csv')\ntest_id = test['id']\n\nNFOLDS = 5\nkfold = StratifiedKFold(n_splits=NFOLDS, shuffle=True, random_state=218)\nlearning_rate = 0.1\nnum_leaves = 15\nmin_data_in_leaf = 2000\n# max_bin = x\nfeature_fraction = 0.6\nnum_boost_round = 10000\n\ny = train['target'].values\ndrop_feature = [\n    'id',\n    'target'\n]\n\nX = train.drop(drop_feature,axis=1)\nfeature_names = X.columns.tolist()\ncat_features = [c for c in feature_names if ('cat' in c and 'count' not in c)]\nnum_features = [c for c in feature_names if ('cat' not in c and 'calc' not in c)]\n\ntrain['missing'] = (train==-1).sum(axis=1).astype(float)\ntest['missing'] = (test==-1).sum(axis=1).astype(float)\nnum_features.append('missing')\n\nfrom sklearn.preprocessing import OneHotEncoder,LabelEncoder\nfor c in cat_features:\n    le = LabelEncoder()\n    le.fit(train[c])\n    train[c] = le.transform(train[c])\n    test[c] = le.transform(test[c])\n\nfrom sklearn.preprocessing import normalize\nfrom scipy.stats import spearmanr\nfrom sklearn.preprocessing import OneHotEncoder\nenc = OneHotEncoder()\nenc.fit(train[cat_features])\nX_cat = enc.transform(train[cat_features])\nX_t_cat = enc.transform(test[cat_features])\n\nind_features = [c for c in feature_names if 'ind' in c]\ncount=0\nfor c in ind_features:\n    if count==0:\n        train['new_ind'] = train[c].astype(str)+'_'\n        test['new_ind'] = test[c].astype(str)+'_'\n        count+=1\n    else:\n        train['new_ind'] += train[c].astype(str)+'_'\n        test['new_ind'] += test[c].astype(str)+'_'\n\ncat_count_features = []\nfor c in cat_features+['new_ind']:\n    d = pd.concat([train[c],test[c]]).value_counts().to_dict()\n    train['%s_count'%c] = train[c].apply(lambda x:d.get(x,0))\n    test['%s_count'%c] = test[c].apply(lambda x:d.get(x,0))\n    cat_count_features.append('%s_count'%c)\n    score = spearmanr(train['target'],train['%s_count'%c])\n    print(c,score)\n\ntrain_list = [train[num_features+cat_count_features].values,X_cat,]\ntest_list = [test[num_features+cat_count_features].values,X_t_cat,]\n\n# missing binary projections\n#missing_list = [x for x in list(train) if np.sum(train[x] == -1) > 0]\n#for miss_fea in missing_list:\n#    train['{}_miss_code'.format(miss_fea)] = (train[miss_fea] == -1).astype(int)\n#    test['{}_miss_code'.format(miss_fea)] = (test[miss_fea] == -1).astype(int)\n\nX = ssp.hstack(train_list).tocsr()\nX_test = ssp.hstack(test_list).tocsr()\n\n\nparams = {\"objective\": \"fair\",\n          \"boosting_type\": \"gbdt\",\n          \"learning_rate\": learning_rate,\n          \"num_leaves\": int(num_leaves),\n           \"max_bin\": 256,\n          \"feature_fraction\": feature_fraction,\n          \"verbosity\": 0,\n          \"drop_rate\": 0.1,\n          \"is_unbalance\": False,\n          \"max_drop\": 50,\n          \"min_child_samples\": 10,\n          \"min_child_weight\": 150,\n          \"min_split_gain\": 0,\n          \"subsample\": 0.9\n          }\n\nx_score = []\nfinal_cv_train = np.zeros(len(train_label))\nfinal_cv_pred = np.zeros(len(test_id))\nfor s in xrange(10):\n    cv_train = np.zeros(len(train_label))\n    cv_pred = np.zeros(len(test_id))\n\n    params['seed'] = s\n\n    if cv_only:\n        kf = kfold.split(X, train_label)\n\n        best_trees = []\n        fold_scores = []\n\n        for i, (train_fold, validate) in enumerate(kf):\n            X_train, X_validate, label_train, label_validate = \\\n                X[train_fold, :], X[validate, :], train_label[train_fold], train_label[validate]\n            dtrain = lgbm.Dataset(X_train, label_train)\n            dvalid = lgbm.Dataset(X_validate, label_validate, reference=dtrain)\n            bst = lgbm.train(params, dtrain, num_boost_round, valid_sets=dvalid, feval=evalerror, verbose_eval=100,\n                            early_stopping_rounds=100)\n            best_trees.append(bst.best_iteration)\n            cv_pred += bst.predict(X_test, num_iteration=bst.best_iteration)\n            cv_train[validate] += bst.predict(X_validate)\n\n            score = Gini(label_validate, cv_train[validate])\n            print score\n            fold_scores.append(score)\n\n        cv_pred /= NFOLDS\n        final_cv_train += cv_train\n        final_cv_pred += cv_pred\n\n        print(\"cv score:\")\n        print Gini(train_label, cv_train)\n        print \"current score:\", Gini(train_label, final_cv_train / (s + 1.)), s+1\n        print(fold_scores)\n        print(best_trees, np.mean(best_trees))\n\n        x_score.append(Gini(train_label, cv_train))\n\nprint(x_score)\n\npred_result = pd.DataFrame({'id': test_id, 'target': final_cv_pred / 10.})\npred_result['target'] = pred_result['target'].rank(pct=True)\npred_result.to_csv('../model/lgbm6_pred_avg.csv', index=False)\n\ncv_result = pd.DataFrame({'id': train_id, 'target': final_cv_train / 10.})\ncv_result['target'] = cv_result['target'].rank(pct=True)\ncv_result.to_csv('../model/lgbm6_cv_avg.csv', index=False)\n\n#cv score:\n#0.287007087138\n#current score: 0.289683837899 16\n"
  },
  {
    "path": "code_for_exact_solution/lightgbm7.py",
    "content": "import numpy as np\nimport scipy as sp\nfrom scipy import sparse as ssp\nimport pandas as pd\nfrom sklearn.preprocessing import normalize\nfrom scipy.stats import spearmanr\nimport lightgbm as lgb\nimport lightgbm as lgbm\nfrom sklearn.model_selection import StratifiedKFold\nimport numpy as np\nimport pandas as pd\nfrom util import Gini\nfrom sklearn.preprocessing import LabelEncoder\nfrom itertools import combinations\nfrom util import proj_num_on_cat\nfrom sklearn.preprocessing import OneHotEncoder\nfrom scipy import sparse\nimport lightgbm as lgbm\nfrom sklearn.model_selection import StratifiedKFold\nimport numpy as np\nimport pandas as pd\nfrom util import Gini\nfrom sklearn.preprocessing import LabelEncoder\nfrom itertools import combinations\nfrom util import proj_num_on_cat, cat_count\nfrom sklearn.preprocessing import OneHotEncoder\nfrom scipy import sparse\n\ncv_only = True\nsave_cv = True\nfull_train = False\n\nimport numpy as np\nimport scipy as sp\nfrom scipy import sparse as ssp\nimport pandas as pd\nimport xgboost as xgb\nfrom sklearn.model_selection import StratifiedKFold, KFold\nimport numpy as np\nimport pandas as pd\nfrom sklearn.preprocessing import LabelEncoder\nfrom itertools import combinations\nfrom util import proj_num_on_cat\nfrom sklearn.preprocessing import OneHotEncoder\nimport category_encoders as ce\nfrom scipy import sparse\nfrom util import Gini, proj_num_on_cat, interaction_features, cat_count\n\n\ndef evalerror(preds, dtrain):\n    labels = dtrain.get_label()\n    return 'gini', Gini(labels, preds), True\n\npath = \"../input/\"\n\ntrain = pd.read_csv(path+'train.csv')\ntrain_label = train['target']\ntrain_id = train['id']\ntest = pd.read_csv(path+'test.csv')\ntest_id = test['id']\n\nNFOLDS = 5\nkfold = StratifiedKFold(n_splits=NFOLDS, shuffle=True, random_state=218)\nlearning_rate = 0.1\nnum_leaves = 15\nmin_data_in_leaf = 2000\n# max_bin = x\nfeature_fraction = 0.6\nnum_boost_round = 10000\n\ny = train['target'].values\ndrop_feature = [\n    'id',\n    'target'\n]\n\nX = train.drop(drop_feature,axis=1)\nfeature_names = X.columns.tolist()\ncat_features = [c for c in feature_names if ('cat' in c and 'count' not in c)]\nnum_features = [c for c in feature_names if ('cat' not in c and 'calc' not in c)]\n\ntrain['missing'] = (train==-1).sum(axis=1).astype(float)\ntest['missing'] = (test==-1).sum(axis=1).astype(float)\nnum_features.append('missing')\n\nfrom sklearn.preprocessing import OneHotEncoder,LabelEncoder\nfor c in cat_features:\n    le = LabelEncoder()\n    le.fit(train[c])\n    train[c] = le.transform(train[c])\n    test[c] = le.transform(test[c])\n\nfrom sklearn.preprocessing import normalize\nfrom scipy.stats import spearmanr\nfrom sklearn.preprocessing import OneHotEncoder\nenc = OneHotEncoder()\nenc.fit(train[cat_features])\nX_cat = enc.transform(train[cat_features])\nX_t_cat = enc.transform(test[cat_features])\n\nind_features = [c for c in feature_names if 'ind' in c]\ncount=0\nfor c in ind_features:\n    if count==0:\n        train['new_ind'] = train[c].astype(str)+'_'\n        test['new_ind'] = test[c].astype(str)+'_'\n        count+=1\n    else:\n        train['new_ind'] += train[c].astype(str)+'_'\n        test['new_ind'] += test[c].astype(str)+'_'\n\ncat_count_features = []\nfor c in cat_features+['new_ind']:\n    d = pd.concat([train[c],test[c]]).value_counts().to_dict()\n    train['%s_count'%c] = train[c].apply(lambda x:d.get(x,0))\n    test['%s_count'%c] = test[c].apply(lambda x:d.get(x,0))\n    cat_count_features.append('%s_count'%c)\n    score = spearmanr(train['target'],train['%s_count'%c])\n    print(c,score)\n\ntrain_list = [train[num_features+cat_count_features].values,X_cat,]\ntest_list = [test[num_features+cat_count_features].values,X_t_cat,]\n\n# missing binary projections\n#missing_list = [x for x in list(train) if np.sum(train[x] == -1) > 0]\n#for miss_fea in missing_list:\n#    train['{}_miss_code'.format(miss_fea)] = (train[miss_fea] == -1).astype(int)\n#    test['{}_miss_code'.format(miss_fea)] = (test[miss_fea] == -1).astype(int)\n\nX = ssp.hstack(train_list).tocsr()\nX_test = ssp.hstack(test_list).tocsr()\n\n\nparams = {\"objective\": \"binary\",\n          \"boosting_type\": \"goss\",\n          \"learning_rate\": learning_rate,\n          \"num_leaves\": int(num_leaves),\n           \"max_bin\": 256,\n          \"feature_fraction\": feature_fraction,\n          \"verbosity\": 0,\n          \"drop_rate\": 0.1,\n          \"is_unbalance\": False,\n          \"max_drop\": 50,\n          \"min_child_samples\": 10,\n          \"min_child_weight\": 150,\n          \"min_split_gain\": 0,\n          \"subsample\": 0.9\n          }\n\nx_score = []\nfinal_cv_train = np.zeros(len(train_label))\nfinal_cv_pred = np.zeros(len(test_id))\nfor s in xrange(10):\n    cv_train = np.zeros(len(train_label))\n    cv_pred = np.zeros(len(test_id))\n\n    params['seed'] = s\n\n    if cv_only:\n        kf = kfold.split(X, train_label)\n\n        best_trees = []\n        fold_scores = []\n\n        for i, (train_fold, validate) in enumerate(kf):\n            X_train, X_validate, label_train, label_validate = \\\n                X[train_fold, :], X[validate, :], train_label[train_fold], train_label[validate]\n            dtrain = lgbm.Dataset(X_train, label_train)\n            dvalid = lgbm.Dataset(X_validate, label_validate, reference=dtrain)\n            bst = lgbm.train(params, dtrain, num_boost_round, valid_sets=dvalid, feval=evalerror, verbose_eval=100,\n                            early_stopping_rounds=100)\n            best_trees.append(bst.best_iteration)\n            cv_pred += bst.predict(X_test, num_iteration=bst.best_iteration)\n            cv_train[validate] += bst.predict(X_validate)\n\n            score = Gini(label_validate, cv_train[validate])\n            print score\n            fold_scores.append(score)\n\n        cv_pred /= NFOLDS\n        final_cv_train += cv_train\n        final_cv_pred += cv_pred\n\n        print(\"cv score:\")\n        print Gini(train_label, cv_train)\n        print \"current score:\", Gini(train_label, final_cv_train / (s + 1.)), s+1\n        print(fold_scores)\n        print(best_trees, np.mean(best_trees))\n\n        x_score.append(Gini(train_label, cv_train))\n\nprint(x_score)\n\npred_result = pd.DataFrame({'id': test_id, 'target': final_cv_pred / 10.})\npred_result['target'] = pred_result['target'].rank(pct=True)\npred_result.to_csv('../model/lgbm7_pred_avg.csv', index=False)\n\ncv_result = pd.DataFrame({'id': train_id, 'target': final_cv_train / 10.})\ncv_result['target'] = cv_result['target'].rank(pct=True)\ncv_result.to_csv('../model/lgbm7_cv_avg.csv', index=False)\n\n#cv score:\n#0.287007087138\n#current score: 0.289683837899 16\n"
  },
  {
    "path": "code_for_exact_solution/lightgbm8.py",
    "content": "import numpy as np\nimport scipy as sp\nfrom scipy import sparse as ssp\nimport pandas as pd\nfrom sklearn.preprocessing import normalize\nfrom scipy.stats import spearmanr\nimport lightgbm as lgb\nimport lightgbm as lgbm\nfrom sklearn.model_selection import StratifiedKFold\nimport numpy as np\nimport pandas as pd\nfrom util import Gini\nfrom sklearn.preprocessing import LabelEncoder\nfrom itertools import combinations\nfrom util import proj_num_on_cat\nfrom sklearn.preprocessing import OneHotEncoder\nfrom scipy import sparse\nimport lightgbm as lgbm\nfrom sklearn.model_selection import StratifiedKFold\nimport numpy as np\nimport pandas as pd\nfrom util import Gini\nfrom sklearn.preprocessing import LabelEncoder\nfrom itertools import combinations\nfrom util import proj_num_on_cat, cat_count\nfrom sklearn.preprocessing import OneHotEncoder\nfrom scipy import sparse\n\ncv_only = True\nsave_cv = True\nfull_train = False\n\nimport numpy as np\nimport scipy as sp\nfrom scipy import sparse as ssp\nimport pandas as pd\nimport xgboost as xgb\nfrom sklearn.model_selection import StratifiedKFold, KFold\nimport numpy as np\nimport pandas as pd\nfrom sklearn.preprocessing import LabelEncoder\nfrom itertools import combinations\nfrom util import proj_num_on_cat\nfrom sklearn.preprocessing import OneHotEncoder\nimport category_encoders as ce\nfrom scipy import sparse\nfrom util import Gini, proj_num_on_cat, interaction_features, cat_count\n\nNFOLDS = 5\nkfold = StratifiedKFold(n_splits=NFOLDS, shuffle=True, random_state=218)\nlearning_rate = 0.1\nnum_leaves = 15\nmin_data_in_leaf = 2000\n# max_bin = x\nfeature_fraction = 0.6\nnum_boost_round = 10000\n\ncv_only = True\nsave_cv = True\nfull_train = True\n\ndef evalerror(preds, dtrain):\n    labels = dtrain.get_label()\n    return 'gini', Gini(labels, preds), True\n\nNFOLDS = 5\nkfold = StratifiedKFold(n_splits=NFOLDS, shuffle=True, random_state=218)\n\ntrain = pd.read_csv(\"../input/train.csv\")\ntrain_label = train['target']\ntrain_id = train['id']\ndel train['target'], train['id']\n\ntest = pd.read_csv(\"../input/test.csv\")\ntest_id = test['id']\ndel test['id']\n\ncat_fea = [x for x in list(train) if 'cat' in x]\nbin_fea = [x for x in list(train) if 'bin' in x]\n\ntrain['missing'] = (train==-1).sum(axis=1).astype(float)\ntest['missing'] = (test==-1).sum(axis=1).astype(float)\n\n# include interactions\nfor e, (x, y) in enumerate(combinations(['ps_car_13', 'ps_ind_03', 'ps_reg_03', 'ps_ind_15', 'ps_reg_01', 'ps_ind_01'], 2)):\n    train, test = interaction_features(train, test, x, y, e)\n\nnum_features = [c for c in list(train) if ('cat' not in c and 'calc' not in c)]\nnum_features.append('missing')\ninter_fea = [x for x in list(train) if 'inter' in x]\n#train['cat_sum'] = train[cat_fea].sum(axis=1)\n#test['cat_sum'] = test[cat_fea].sum(axis=1)\n\npath = \"../input/\"\nnum_features_comb = []\nimport os\nfor p in os.listdir(path):\n    if 'ps_reg_02___ps_car_07_cat' in p or 'ps_reg_01___ps_car_13___ps_car_15' in p:\n        print(p)\n        x,xt = pd.read_pickle(path+p)\n        train[p] = x\n        test[p] = xt\n        num_features_comb.append(p)\n\nnum_features += num_features_comb\n\nfeature_names = list(train)\nind_features = [c for c in feature_names if 'ind' in c]\ncount = 0\nfor c in ind_features:\n    if count == 0:\n        train['new_ind'] = train[c].astype(str)\n        count += 1\n    else:\n        train['new_ind'] += '_' + train[c].astype(str)\n\nprint(train['new_ind'].nunique())\nind_features = [c for c in feature_names if 'ind' in c]\ncount = 0\nfor c in ind_features:\n    if count == 0:\n        test['new_ind'] = test[c].astype(str)\n        count += 1\n    else:\n        test['new_ind'] += '_' + test[c].astype(str)\n\nreg_features = [c for c in feature_names if 'reg' in c]\ncount = 0\nfor c in reg_features:\n    if count == 0:\n        train['new_reg'] = train[c].astype(str)\n        count += 1\n    else:\n        train['new_reg'] += '_' + train[c].astype(str)\n\nprint(train['new_reg'].nunique())\nreg_features = [c for c in feature_names if 'reg' in c]\ncount = 0\nfor c in reg_features:\n    if count == 0:\n        test['new_reg'] = test[c].astype(str)\n        count += 1\n    else:\n        test['new_reg'] += '_' + test[c].astype(str)\n\ncar_features = [c for c in feature_names if 'car' in c]\ncount = 0\nfor c in car_features:\n    if count == 0:\n        train['new_car'] = train[c].astype(str)\n        count += 1\n    else:\n        train['new_car'] += '_' + train[c].astype(str)\n\nprint(train['new_car'].nunique())\ncar_features = [c for c in feature_names if 'car' in c]\ncount = 0\nfor c in car_features:\n    if count == 0:\n        test['new_car'] = test[c].astype(str)\n        count += 1\n    else:\n        test['new_car'] += '_' + test[c].astype(str)\n\nnew_ps_reg_03 = pd.read_pickle(path + 'new_ps_reg_03.pkl')\ntrain['ps_reg_03'] = new_ps_reg_03[:train.shape[0]]\ntest['ps_reg_03'] = new_ps_reg_03[train.shape[0]:]\nprint(train['ps_reg_03'].head(10))\n\n\n#X = train.as_matrix()\n#X_test = test.as_matrix()\n#print(X.shape, X_test.shape)\n#ohe\nohe = OneHotEncoder(sparse=True)\n\ntrain_cat = train[cat_fea].as_matrix()\ntrain_num = train[[x for x in list(train) if x in num_features]]\ntest_cat = test[cat_fea].as_matrix()\ntest_num = test[[x for x in list(train) if x in num_features]]\ntrain_cat[train_cat < 0] = 99\ntest_cat[test_cat < 0] = 99\n\ntraintest = np.vstack((train_cat, test_cat))\ntraintest = pd.DataFrame(traintest, columns=cat_fea)\nprint(traintest.shape)\n#encoder = ce.HelmertEncoder(cols=cat_fea)\n#encoder.fit(traintest)\n#train_enc = encoder.transform(pd.DataFrame(train_cat, columns=cat_fea))\n#test_enc = encoder.transform(pd.DataFrame(test_cat, columns=cat_fea))\nohe.fit(traintest)\ntrain_ohe = ohe.transform(train_cat)\ntest_ohe = ohe.transform(test_cat)\ndel traintest\n\ncat_count_features = []\nfor c in cat_fea + ['new_ind','new_reg','new_car']:\n    d = pd.concat([train[c],test[c]]).value_counts().to_dict()\n    train['%s_count'%c] = train[c].apply(lambda x:d.get(x,0))\n    test['%s_count'%c] = test[c].apply(lambda x:d.get(x,0))\n    cat_count_features.append('%s_count'%c)\n\n\nprint(train_num.dtypes)\ntrain_list = [train_num.replace([np.inf, -np.inf, np.nan], 0), train_ohe, train[cat_count_features]]#, np.ones(shape=(train_num.shape[0], 1))]\ntest_list = [test_num.replace([np.inf, -np.inf, np.nan], 0), test_ohe, test[cat_count_features]]#, np.ones(shape=(test_num.shape[0], 1))]\n\n#proj\nfor t in ['ps_car_13', 'ps_ind_03', 'ps_reg_03', 'ps_ind_15', 'ps_reg_01', 'ps_ind_01']:\n    for g in ['ps_car_13', 'ps_ind_03', 'ps_reg_03', 'ps_ind_15', 'ps_reg_01', 'ps_ind_01', 'ps_ind_05_cat']:\n        if t != g:\n            s_train, s_test = proj_num_on_cat(train, test, target_column=t, group_column=g)\n            train_list.append(s_train)\n            test_list.append(s_test)\nX = sparse.hstack(train_list).tocsr()\nX_test = sparse.hstack(test_list).tocsr()\n\nparams = {\"objective\": \"binary\",\n          \"boosting_type\": \"gbdt\",\n          \"learning_rate\": learning_rate,\n          \"num_leaves\": int(num_leaves),\n           \"max_bin\": 256,\n          \"feature_fraction\": feature_fraction,\n          \"verbosity\": 0,\n          \"drop_rate\": 0.1,\n          \"is_unbalance\": False,\n          \"max_drop\": 50,\n          \"min_child_samples\": 10,\n          \"min_child_weight\": 150,\n          \"min_split_gain\": 0,\n          \"subsample\": 0.9\n          }\n\nx_score = []\nfinal_cv_train = np.zeros(len(train_label))\nfinal_cv_pred = np.zeros(len(test_id))\nfor s in xrange(8):\n    cv_train = np.zeros(len(train_label))\n    cv_pred = np.zeros(len(test_id))\n\n    params['seed'] = s\n\n    if cv_only:\n        kf = kfold.split(X, train_label)\n\n        best_trees = []\n        fold_scores = []\n\n        for i, (train_fold, validate) in enumerate(kf):\n            X_train, X_validate, label_train, label_validate = \\\n                X[train_fold, :], X[validate, :], train_label[train_fold], train_label[validate]\n            dtrain = lgbm.Dataset(X_train, label_train)\n            dvalid = lgbm.Dataset(X_validate, label_validate, reference=dtrain)\n            bst = lgbm.train(params, dtrain, num_boost_round, valid_sets=dvalid, feval=evalerror, verbose_eval=100,\n                            early_stopping_rounds=100)\n            best_trees.append(bst.best_iteration)\n            cv_pred += bst.predict(X_test, num_iteration=bst.best_iteration)\n            cv_train[validate] += bst.predict(X_validate)\n\n            score = Gini(label_validate, cv_train[validate])\n            print score\n            fold_scores.append(score)\n\n        cv_pred /= NFOLDS\n        final_cv_train += cv_train\n        final_cv_pred += cv_pred\n\n        print(\"cv score:\")\n        print Gini(train_label, cv_train)\n        print \"current score:\", Gini(train_label, final_cv_train / (s + 1.)), s+1\n        print(fold_scores)\n        print(best_trees, np.mean(best_trees))\n\n        x_score.append(Gini(train_label, cv_train))\n\nprint(x_score)\npd.DataFrame({'id': test_id, 'target': final_cv_pred / 16.}).to_csv('../model/lgbm8_pred_avg.csv', index=False)\npd.DataFrame({'id': train_id, 'target': final_cv_train / 16.}).to_csv('../model/lgbm8_cv_avg.csv', index=False)\n\n#cv score:\n#0.287007087138\n#current score: 0.289683837899 16\n"
  },
  {
    "path": "code_for_exact_solution/logistic1.py",
    "content": "'''\nsimple xgboost benchmark\n'''\nfrom sklearn.model_selection import StratifiedKFold, KFold\nfrom sklearn.feature_selection import SelectPercentile, f_classif\nimport numpy as np\nimport pandas as pd\nfrom util import Gini, proj_num_on_cat, interaction_features, cat_count\nfrom sklearn.preprocessing import LabelEncoder\nfrom itertools import combinations\nfrom util import proj_num_on_cat\nfrom sklearn.preprocessing import OneHotEncoder\nimport category_encoders as ce\nfrom scipy import sparse\nfrom sklearn.linear_model import LogisticRegression as LR\nfrom sklearn.preprocessing import StandardScaler\nfrom scipy.sparse import csr_matrix\nimport pickle\ncv_only = True\nsave_cv = True\nfull_train = True\n\ndef evalerror(preds, dtrain):\n    labels = dtrain.get_label()\n    return 'gini', Gini(labels, preds)\n\nNFOLDS = 5\nkfold = StratifiedKFold(n_splits=NFOLDS, shuffle=True, random_state=218)\n\ntrain = pd.read_csv(\"../input/train.csv\")\ntrain_label = train['target']\ntrain_id = train['id']\ndel train['target'], train['id']\n\ntest = pd.read_csv(\"../input/test.csv\")\ntest_id = test['id']\ndel test['id']\n\ntrain_copy = train.copy()\ntest_copy = test.copy()\ntrain_copy = train_copy.replace(-1, np.NaN)\ntest_copy = test_copy.replace(-1, np.NaN)\ntrain['num_na'] = train_copy.isnull().sum(axis=1)\ntest['num_na'] = test_copy.isnull().sum(axis=1)\ndel train_copy, test_copy\n\ncat_fea = [x for x in list(train) if 'cat' in x]\nbin_fea = [x for x in list(train) if 'bin' in x]\ntrain_cat_count, test_cat_count = cat_count(train, test, cat_fea)\n# include interactions\nfor e, (x, y) in enumerate(combinations(['ps_car_13', 'ps_ind_03', 'ps_reg_03', 'ps_ind_15', 'ps_reg_01', 'ps_ind_01'], 2)):\n    train, test = interaction_features(train, test, x, y, e)\n\n\ninter_fea = [x for x in list(train) if 'inter' in x]\n#train['cat_sum'] = train[cat_fea].sum(axis=1)\n#test['cat_sum'] = test[cat_fea].sum(axis=1)\n\n\n#X = train.as_matrix()\n#X_test = test.as_matrix()\n#print(X.shape, X_test.shape)\n#ohe\nohe = OneHotEncoder(sparse=True)\n\ntrain_cat = train[cat_fea].as_matrix()\ntrain_num = train[[x for x in list(train) if x not in cat_fea]]\ntest_cat = test[cat_fea].as_matrix()\ntest_num = test[[x for x in list(train) if x not in cat_fea]]\ntrain_cat[train_cat < 0] = 99\ntest_cat[test_cat < 0] = 99\n\ntraintest = np.vstack((train_cat, test_cat))\ntraintest = pd.DataFrame(traintest, columns=cat_fea)\nprint(traintest.shape)\n#encoder = ce.HelmertEncoder(cols=cat_fea)\n#encoder.fit(traintest)\n#train_enc = encoder.transform(pd.DataFrame(train_cat, columns=cat_fea))\n#test_enc = encoder.transform(pd.DataFrame(test_cat, columns=cat_fea))\nohe.fit(traintest)\ntrain_ohe = ohe.transform(train_cat)\ntest_ohe = ohe.transform(test_cat)\ndel traintest\n\ntrain_fea0, test_fea0 = pickle.load(open(\"../input/fea0.pk\"))\n\n\ntrain_list = [train_num.replace([np.inf, -np.inf, np.nan], 0), train_ohe, train_fea0, train_cat_count]#, np.ones(shape=(train_num.shape[0], 1))]\ntest_list = [test_num.replace([np.inf, -np.inf, np.nan], 0), test_ohe, test_fea0, test_cat_count]#, np.ones(shape=(test_num.shape[0], 1))]\n\n#proj\nfor t in ['ps_car_13', 'ps_ind_03', 'ps_reg_03', 'ps_ind_15', 'ps_reg_01', 'ps_ind_01']:\n    for g in ['ps_car_13', 'ps_ind_03', 'ps_reg_03', 'ps_ind_15', 'ps_reg_01', 'ps_ind_01', 'ps_ind_05_cat']:\n        if t != g:\n            s_train, s_test = proj_num_on_cat(train, test, target_column=t, group_column=g)\n            train_list.append(s_train)\n            test_list.append(s_test)\n\nX = sparse.hstack(train_list).tocsr()\nX_test = sparse.hstack(test_list).tocsr()\n#X = train_num\n#X_test = test_num\n#all_data = np.vstack([X, X_test])\n\nselector = SelectPercentile(f_classif, 75)\nselector.fit(X.toarray(), train_label)\nX, X_test = selector.transform(X.toarray()), selector.transform(X_test.toarray())\n\nall_data = np.vstack([X, X_test])\nscaler = StandardScaler()\nscaler.fit(all_data)\n\nX = csr_matrix(scaler.transform(X))\nX_test = csr_matrix(scaler.transform(X_test))\n#X = scaler.transform(X)\n#X_test = scaler.transform(X_test)\n\nprint(X.shape, X_test.shape)\n\nkf = kfold.split(X, train_label)\ncv_train = np.zeros(len(train_label))\ncv_pred = np.zeros(len(test_id))\nbest_trees = []\nfold_scores = []\n\nif cv_only:\n    for i, (train_fold, validate) in enumerate(kf):\n        X_train, X_validate, label_train, label_validate = \\\n            X[train_fold, :], X[validate, :], train_label[train_fold], train_label[validate]\n\n        #selector = SelectPercentile(f_classif, X_train.toarray(), label_train)\n        #X_train, X_validate = csr_matrix(selector.transform(X_train.toarray())), csr_matrix(selector.transform(\n        #    X_validate.toarray()\n        #))\n\n        clf = LR(C=25.)\n        clf.fit(X_train, label_train)\n        cv_pred += clf.predict_proba(X_test)[:, 1]\n        cv_train[validate] += clf.predict_proba(X_validate)[:, 1]\n        score = Gini(label_validate, cv_train[validate])\n        print score\n        fold_scores.append(score)\n\n    print(\"cv score:\")\n    print Gini(train_label, cv_train)\n    print(fold_scores)\n\n    if save_cv:\n        pd.DataFrame({'id': test_id, 'target': cv_pred/NFOLDS}).to_csv('../model/logistic1_pred.csv', index=False)\n        pd.DataFrame({'id': train_id, 'target': cv_train}).to_csv('../model/logistic1_cv.csv', index=False)\n\n#cv score:\n#0.27906438918\n#[0.28773918373071583, 0.26723600995806762, 0.28158062789785737, 0.27506435916773242, 0.28394519465855006]\n\nif full_train:\n    clf = LR(C=25.)\n    clf.fit(X, train_label)\n    pred = clf.predict_proba(X_test)[:, 1]\n    pd.DataFrame({'id': test_id, 'target': pred}).to_csv('../model/logistic1_full_pred.csv', index=False)\n\n\n"
  },
  {
    "path": "code_for_exact_solution/rank_average.py",
    "content": "import pandas as pd\nfrom util import Gini\n\ndef get_rank(x):\n    return pd.Series(x).rank(pct=True).values\n\ntrain = pd.read_csv(\"../input/train.csv\", usecols = ['target'])\n\nkeras3_train = pd.read_csv(\"../model/keras3_cv.csv\")\nkeras5_train = pd.read_csv(\"../model/keras5_cv.csv\")\nkeras6_train = pd.read_csv(\"../model/keras6_cv.csv\")\nkeras7_train = pd.read_csv(\"../model/keras7_cv.csv\")\n\nlgbm1_train = pd.read_csv(\"../model/lgbm1_cv_avg.csv\")\nlgbm3_train = pd.read_csv(\"../model/lgbm3_cv_avg.csv\")\nlgbm8_train = pd.read_csv(\"../model/lgbm8_cv_avg.csv\")\nlgbm5_train = pd.read_csv(\"../model/lgbm5_cv_avg.csv\")\nlgbm6_train = pd.read_csv(\"../model/lgbm6_cv_avg.csv\")\nlgbm7_train = pd.read_csv(\"../model/lgbm7_cv_avg.csv\")\n\n\nlogistic1_train = pd.read_csv(\"../model/logistic1_cv.csv\")\n\nxgb0_train = pd.read_csv(\"../model/xgb0_cv.csv\")\n\nkeras3_test = pd.read_csv(\"../model/keras3_pred.csv\")\nkeras5_test = pd.read_csv(\"../model/keras5_pred.csv\")\nkeras6_test = pd.read_csv(\"../model/keras6_pred.csv\")\nkeras7_test = pd.read_csv(\"../model/keras7_pred.csv\")\n\n\nlgbm1_test = pd.read_csv(\"../model/lgbm1_pred_avg.csv\")\nlgbm3_test = pd.read_csv(\"../model/lgbm3_pred_avg.csv\")\nlgbm8_test = pd.read_csv(\"../model/lgbm8_pred_avg.csv\")\nlgbm5_test = pd.read_csv(\"../model/lgbm5_pred_avg.csv\")\nlgbm6_test = pd.read_csv(\"../model/lgbm6_pred_avg.csv\")\nlgbm7_test = pd.read_csv(\"../model/lgbm7_pred_avg.csv\")\n\n\nlogistic1_test = pd.read_csv(\"../model/logistic1_pred.csv\")\n\nxgb0_test = pd.read_csv(\"../model/xgb0_pred.csv\")\n\nxgblinear_train = pd.read_csv(\"../model/xgb0l_cv.csv\")\nxgblinear_test = pd.read_csv(\"../model/xgb0l_pred.csv\")\n\n\nresult = get_rank(keras5_train['target']) * 0.4 + get_rank(lgbm3_train['target']) * 0.5 + \\\n         get_rank(xgb0_train['target']) * 0.1 + get_rank(lgbm1_train['target']) * (-0.1) + \\\n         get_rank(keras3_train['target']) * 0.1 + get_rank(logistic1_train['target']) * 0.1 + \\\n         get_rank(xgblinear_train['target']) * 0.1 + get_rank(lgbm8_train['target']) * 0.25 + \\\n         get_rank(lgbm5_train['target']) * 0.1 + \\\n         get_rank(lgbm6_train['target']) * (-0.1) + get_rank(lgbm7_train['target']) * (0.1) + \\\n         get_rank(keras6_train['target']) * (-0.1) + \\\n         get_rank(keras7_train['target']) * 0.3\n\nprint \"cv of final averaged model:\", Gini(train['target'], result)\n\nresult = get_rank(keras5_test['target']) * 0.4 + get_rank(lgbm3_test['target']) * 0.5 + \\\n    get_rank(xgb0_test['target']) * 0.1 + get_rank(lgbm1_test['target']) * (-0.1) + \\\n    get_rank(keras3_test['target']) * 0.1 + get_rank(logistic1_test['target']) * 0.1 + \\\n    get_rank(xgblinear_test['target']) * 0.1 + get_rank(lgbm8_test['target']) * 0.25 + \\\n    get_rank(lgbm5_test['target']) * 0.1 + \\\n    get_rank(lgbm6_test['target']) * (-0.1) + get_rank(lgbm7_test['target']) * (0.1) + \\\n    get_rank(keras6_test['target']) * (-0.1) + \\\n    get_rank(keras7_test['target']) * 0.3\n\npd.DataFrame({'id': keras5_test['id'], 'target': get_rank(result)}).to_csv(\"../model/all_average.csv\", index = False)"
  },
  {
    "path": "code_for_exact_solution/util.py",
    "content": "import numpy as np\nimport pandas as pd\n\ndef Gini(y_true, y_pred):\n    # check and get number of samples\n    assert y_true.shape == y_pred.shape\n    n_samples = y_true.shape[0]\n\n    # sort rows on prediction column\n    # (from largest to smallest)\n    arr = np.array([y_true, y_pred]).transpose()\n    true_order = arr[arr[:, 0].argsort()][::-1, 0]\n    pred_order = arr[arr[:, 1].argsort()][::-1, 0]\n\n    # get Lorenz curves\n    L_true = np.cumsum(true_order) * 1. / np.sum(true_order)\n    L_pred = np.cumsum(pred_order) * 1. / np.sum(pred_order)\n    L_ones = np.linspace(1 / n_samples, 1, n_samples)\n\n    # get Gini coefficients (area between curves)\n    G_true = np.sum(L_ones - L_true)\n    G_pred = np.sum(L_ones - L_pred)\n\n    # normalize to true Gini coefficient\n    return G_pred * 1. / G_true\n\n\ndef cat_count(train_df, test_df, cat_list):\n    train_df['row_id'] = range(train_df.shape[0])\n    test_df['row_id'] = range(test_df.shape[0])\n    train_df['train'] = 1\n    test_df['train'] = 0\n    all_df = train_df[['row_id', 'train'] + cat_list].append(test_df[['row_id','train'] + cat_list])\n    for e, cat in enumerate(cat_list):\n        grouped = all_df[[cat]].groupby(cat)\n        the_size = pd.DataFrame(grouped.size()).reset_index()\n        the_size.columns = [cat, '{}_size'.format(cat)]\n        all_df = pd.merge(all_df, the_size, how='left')\n\n        selected_train = all_df[all_df['train'] == 1]\n        selected_test = all_df[all_df['train'] == 0]\n        selected_train.sort_values('row_id', inplace=True)\n        selected_test.sort_values('row_id', inplace=True)\n        selected_train.drop(['row_id', 'train'] + cat_list, axis=1, inplace=True)\n        selected_test.drop(['row_id', 'train'] + cat_list, axis=1, inplace=True)\n\n        selected_train, selected_test = np.array(selected_train), np.array(selected_test)\n    print(selected_train.shape, selected_test.shape)\n    return selected_train, selected_test\n\n\ndef proj_num_on_cat(train_df, test_df, target_column, group_column):\n    \"\"\"\n    :param train_df: train data frame\n    :param test_df:  test data frame\n    :param target_column: name of numerical feature\n    :param group_column: name of categorical feature\n    \"\"\"\n    train_df['row_id'] = range(train_df.shape[0])\n    test_df['row_id'] = range(test_df.shape[0])\n    train_df['train'] = 1\n    test_df['train'] = 0\n    all_df = train_df[['row_id', 'train', target_column, group_column]].append(test_df[['row_id','train',\n                                                                                        target_column, group_column]])\n    grouped = all_df[[target_column, group_column]].groupby(group_column)\n    the_size = pd.DataFrame(grouped.size()).reset_index()\n    the_size.columns = [group_column, '%s_size' % target_column]\n    the_mean = pd.DataFrame(grouped.mean()).reset_index()\n    the_mean.columns = [group_column, '%s_mean' % target_column]\n    the_std = pd.DataFrame(grouped.std()).reset_index().fillna(0)\n    the_std.columns = [group_column, '%s_std' % target_column]\n    the_median = pd.DataFrame(grouped.median()).reset_index()\n    the_median.columns = [group_column, '%s_median' % target_column]\n    the_stats = pd.merge(the_size, the_mean)\n    the_stats = pd.merge(the_stats, the_std)\n    the_stats = pd.merge(the_stats, the_median)\n\n    the_max = pd.DataFrame(grouped.max()).reset_index()\n    the_max.columns = [group_column, '%s_max' % target_column]\n    the_min = pd.DataFrame(grouped.min()).reset_index()\n    the_min.columns = [group_column, '%s_min' % target_column]\n\n    the_stats = pd.merge(the_stats, the_max)\n    the_stats = pd.merge(the_stats, the_min)\n\n    all_df = pd.merge(all_df, the_stats, how='left')\n\n    selected_train = all_df[all_df['train'] == 1]\n    selected_test = all_df[all_df['train'] == 0]\n    selected_train.sort_values('row_id', inplace=True)\n    selected_test.sort_values('row_id', inplace=True)\n    selected_train.drop([target_column, group_column, 'row_id', 'train'], axis=1, inplace=True)\n    selected_test.drop([target_column, group_column, 'row_id', 'train'], axis=1, inplace=True)\n\n    selected_train, selected_test = np.array(selected_train), np.array(selected_test)\n    print(selected_train.shape, selected_test.shape)\n    return selected_train, selected_test\n\n\ndef interaction_features(train, test, fea1, fea2, prefix):\n    train['inter_{}*'.format(prefix)] = train[fea1] * train[fea2]\n    train['inter_{}/'.format(prefix)] = train[fea1] / train[fea2]\n\n    test['inter_{}*'.format(prefix)] = test[fea1] * test[fea2]\n    test['inter_{}/'.format(prefix)] = test[fea1] / test[fea2]\n\n    return train, test\n"
  },
  {
    "path": "code_for_exact_solution/xgb0.py",
    "content": "'''\nsimple xgboost benchmark\n'''\nimport xgboost as xgb\nfrom sklearn.model_selection import StratifiedKFold, KFold\nimport numpy as np\nimport pandas as pd\nfrom util import Gini\nfrom sklearn.preprocessing import LabelEncoder\nfrom itertools import combinations\nfrom util import proj_num_on_cat\nfrom sklearn.preprocessing import OneHotEncoder\nimport category_encoders as ce\nfrom scipy import sparse\n\ncv_only = True\nsave_cv = True\nfull_train = False\n\ndef evalerror(preds, dtrain):\n    labels = dtrain.get_label()\n    return 'gini', Gini(labels, preds)\n\nNFOLDS = 5\nkfold = StratifiedKFold(n_splits=NFOLDS, shuffle=True, random_state=218)\n#kfold = KFold(n_splits=NFOLDS, shuffle=True, random_state=0)\neta = 0.05\nmax_depth = 7\nsubsample = 0.97\ncolsample_bytree = 0.85\ngamma = 0.05\nalpha = 0\nmin_child_weight = 55\n#lamb = 0.35\ncolsample_bylevel = 0.8\nnum_boost_round = 10000\n\ntrain = pd.read_csv(\"../input/train.csv\")\ntrain_label = train['target']\ntrain_id = train['id']\ndel train['target'], train['id']\n\ntest = pd.read_csv(\"../input/test.csv\")\ntest_id = test['id']\ndel test['id']\n\ntrain_copy = train.copy()\ntest_copy = test.copy()\ntrain_copy = train_copy.replace(-1, np.NaN)\ntest_copy = test_copy.replace(-1, np.NaN)\ntrain['num_na'] = train_copy.isnull().sum(axis=1)\ntest['num_na'] = test_copy.isnull().sum(axis=1)\ndel train_copy, test_copy\n\ncat_fea = [x for x in list(train) if 'cat' in x]\nbin_fea = [x for x in list(train) if 'bin' in x]\n\n#train['cat_sum'] = train[cat_fea].sum(axis=1)\n#test['cat_sum'] = test[cat_fea].sum(axis=1)\n\n\n#X = train.as_matrix()\n#X_test = test.as_matrix()\n#print(X.shape, X_test.shape)\n#ohe\nohe = OneHotEncoder(sparse=True)\n\ncat_fea = [x for x in list(train) if 'cat' in x]\ntrain_cat = train[cat_fea].as_matrix()\ntrain_num = train[[x for x in list(train) if x not in cat_fea]]\ntest_cat = test[cat_fea].as_matrix()\ntest_num = test[[x for x in list(train) if x not in cat_fea]]\ntrain_cat[train_cat < 0] = 99\ntest_cat[test_cat < 0] = 99\n\ntraintest = np.vstack((train_cat, test_cat))\ntraintest = pd.DataFrame(traintest, columns=cat_fea)\nprint(traintest.shape)\n#encoder = ce.HelmertEncoder(cols=cat_fea)\n#encoder.fit(traintest)\n#train_enc = encoder.transform(pd.DataFrame(train_cat, columns=cat_fea))\n#test_enc = encoder.transform(pd.DataFrame(test_cat, columns=cat_fea))\nohe.fit(traintest)\ntrain_ohe = ohe.transform(train_cat)\ntest_ohe = ohe.transform(test_cat)\ndel traintest\n\ntrain_list = [train_num, train_ohe]#, np.ones(shape=(train_num.shape[0], 1))]\ntest_list = [test_num, test_ohe]#, np.ones(shape=(test_num.shape[0], 1))]\n\nX = sparse.hstack(train_list).tocsr()\nX_test = sparse.hstack(test_list).tocsr()\n#X, X_test = X.toarray(), X_test.toarray()\nprint(X.shape, X_test.shape)\n\nfinal_cv_train = np.zeros(len(train_label))\nfinal_cv_pred = np.zeros(len(test_id))\nfinal_best_trees = []\n\nparams = {\"objective\": \"binary:logistic\",\n          \"booster\": \"gbtree\",\n          \"eta\": eta,\n          \"max_depth\": int(max_depth),\n          \"subsample\": subsample,\n          \"colsample_bytree\": colsample_bytree,\n          \"gamma\": gamma,\n          #\"lamb\": lamb,\n          \"alpha\": alpha,\n          \"min_child_weight\": min_child_weight,\n          \"colsample_bylevel\": colsample_bylevel,\n          \"silent\": 1\n          }\n\nif cv_only:\n    num_seeds = 24\n    for s in xrange(num_seeds):\n        print(s)\n        params['seed'] = s\n        kf = kfold.split(X, train_label)\n        cv_train = np.zeros(len(train_label))\n        cv_pred = np.zeros(len(test_id))\n        best_trees = []\n        fold_scores = []\n\n        for i, (train_fold, validate) in enumerate(kf):\n            X_train, X_validate, label_train, label_validate = \\\n                X[train_fold, :], X[validate, :], train_label[train_fold], train_label[validate]\n            dtrain = xgb.DMatrix(X_train, label_train)\n            dvalid = xgb.DMatrix(X_validate, label_validate)\n            watchlist = [(dtrain, 'train'), (dvalid, 'valid')]\n            bst = xgb.train(params, dtrain, num_boost_round, evals=watchlist, feval=evalerror, verbose_eval=800,\n                            early_stopping_rounds=25, maximize=True)\n            best_trees.append(bst.best_iteration)\n            cv_pred += bst.predict(xgb.DMatrix(X_test))\n            cv_train[validate] += bst.predict(xgb.DMatrix(X_validate), ntree_limit=bst.best_ntree_limit)\n            score = Gini(label_validate, cv_train[validate])\n            print score\n            fold_scores.append(score)\n\n        final_cv_train += cv_train\n        final_cv_pred += cv_pred\n        final_best_trees += best_trees\n        print(\"cv score:\")\n        print Gini(train_label, cv_train)\n        print(fold_scores)\n        print(best_trees, np.mean(best_trees))\n        print(\"current score:\", Gini(train_label, final_cv_train * 1. / (s + 1)), s+1)\n\n\n    final_cv_pred /= (NFOLDS * num_seeds)\n    final_cv_train /= num_seeds\n    pd.DataFrame({'id': test_id, 'target': final_cv_pred}).to_csv('../model/xgb_avg16_pred.csv', index=False)\n    pd.DataFrame({'id': train_id, 'target': final_cv_train}).to_csv('../model/xgb_avg16_cv.csv', index=False)\n    print(np.mean(final_best_trees), np.median(final_best_trees), np.std(final_best_trees))\n\n    ## 0.1\n    #0.281739276885\n    #[0.28693135981084533, 0.26989064676756958, 0.28035898856108521, 0.28178381987103512, 0.29021910168396381]\n    #([123, 91, 139, 97, 92], 108.40000000000001)\n\n\n    #0.284350552387\n    #([1057, 933, 1175, 979, 1168], 1062.4000000000001)\n\nif full_train:\n    for s in xrange(32):\n        params['seed'] = s\n        dtrain = xgb.DMatrix(X, train_label)\n        watchlist = [(dtrain, 'train')]\n        bst = xgb.train(params, dtrain, 100, evals=watchlist, feval=evalerror, verbose_eval=50, maximize=True)\n        pred = bst.predict(xgb.DMatrix(X_test))\n        if s == 0:\n            final_pred = pred\n        else:\n            final_pred += pred\n    pd.DataFrame({'id': test_id, 'target': final_pred / 32.}).to_csv('../model/xgb_avg_full_pred.csv', index=False)\n\n\n"
  },
  {
    "path": "code_for_exact_solution/xgb_linear0.py",
    "content": "import os\nimport sys\nimport operator\nimport numpy as np\nimport pandas as pd\nfrom scipy import sparse\nimport xgboost as xgb\nfrom sklearn import model_selection, preprocessing, ensemble\nfrom keras.models import Sequential\nfrom keras.layers import Dense, Dropout, Activation\nfrom keras.layers.normalization import BatchNormalization\nfrom keras.layers.advanced_activations import PReLU\nfrom sklearn.preprocessing import OneHotEncoder\nfrom sklearn.model_selection import StratifiedKFold\nfrom time import time\nimport datetime\nfrom util import Gini, proj_num_on_cat, interaction_features, cat_count\nfrom sklearn.preprocessing import StandardScaler\nfrom itertools import combinations\nimport pickle\n\n'''\nsimple xgboost benchmark\n'''\nfrom sklearn.model_selection import StratifiedKFold, KFold\nfrom sklearn.feature_selection import SelectPercentile, f_classif\nimport numpy as np\nimport pandas as pd\nfrom util import Gini, proj_num_on_cat, interaction_features, cat_count\nfrom sklearn.preprocessing import LabelEncoder\nfrom itertools import combinations\nfrom util import proj_num_on_cat\nfrom sklearn.preprocessing import OneHotEncoder\nimport category_encoders as ce\nfrom scipy import sparse\nfrom sklearn.linear_model import LogisticRegression as LR\nfrom sklearn.preprocessing import StandardScaler\nfrom scipy.sparse import csr_matrix\nimport pickle\ncv_only = True\nsave_cv = True\nfull_train = True\n\ndef evalerror(preds, dtrain):\n    labels = dtrain.get_label()\n    return 'gini', Gini(labels, preds)\n\nNFOLDS = 5\nkfold = StratifiedKFold(n_splits=NFOLDS, shuffle=True, random_state=218)\n\ntrain = pd.read_csv(\"../input/train.csv\")\ntrain_label = train['target']\ntrain_id = train['id']\ndel train['target'], train['id']\n\ntest = pd.read_csv(\"../input/test.csv\")\ntest_id = test['id']\ndel test['id']\n\ncat_fea = [x for x in list(train) if 'cat' in x]\nbin_fea = [x for x in list(train) if 'bin' in x]\n\ntrain['missing'] = (train==-1).sum(axis=1).astype(float)\ntest['missing'] = (test==-1).sum(axis=1).astype(float)\n\n# include interactions\nfor e, (x, y) in enumerate(combinations(['ps_car_13', 'ps_ind_03', 'ps_reg_03', 'ps_ind_15', 'ps_reg_01', 'ps_ind_01'], 2)):\n    train, test = interaction_features(train, test, x, y, e)\n\nnum_features = [c for c in list(train) if ('cat' not in c and 'calc' not in c)]\nnum_features.append('missing')\ninter_fea = [x for x in list(train) if 'inter' in x]\n#train['cat_sum'] = train[cat_fea].sum(axis=1)\n#test['cat_sum'] = test[cat_fea].sum(axis=1)\n\n\n#X = train.as_matrix()\n#X_test = test.as_matrix()\n#print(X.shape, X_test.shape)\n#ohe\nohe = OneHotEncoder(sparse=True)\n\ntrain_cat = train[cat_fea].as_matrix()\ntrain_num = train[[x for x in list(train) if x in num_features]]\ntest_cat = test[cat_fea].as_matrix()\ntest_num = test[[x for x in list(train) if x in num_features]]\ntrain_cat[train_cat < 0] = 99\ntest_cat[test_cat < 0] = 99\n\ntraintest = np.vstack((train_cat, test_cat))\ntraintest = pd.DataFrame(traintest, columns=cat_fea)\nprint(traintest.shape)\n#encoder = ce.HelmertEncoder(cols=cat_fea)\n#encoder.fit(traintest)\n#train_enc = encoder.transform(pd.DataFrame(train_cat, columns=cat_fea))\n#test_enc = encoder.transform(pd.DataFrame(test_cat, columns=cat_fea))\nohe.fit(traintest)\ntrain_ohe = ohe.transform(train_cat)\ntest_ohe = ohe.transform(test_cat)\ndel traintest\n\ntrain_fea0, test_fea0 = pickle.load(open(\"../input/fea0.pk\"))\ntrain_fea1, test_fea1 = pickle.load(open(\"../input/fea0_lgb.pk\"))\n\ncat_count_features = []\nfor c in cat_fea:\n    d = pd.concat([train[c],test[c]]).value_counts().to_dict()\n    train['%s_count'%c] = train[c].apply(lambda x:d.get(x,0))\n    test['%s_count'%c] = test[c].apply(lambda x:d.get(x,0))\n    cat_count_features.append('%s_count'%c)\n\ntrain_list = [train_num.replace([np.inf, -np.inf, np.nan], 0), train_ohe, train[cat_count_features]]#, np.ones(shape=(train_num.shape[0], 1))]\ntest_list = [test_num.replace([np.inf, -np.inf, np.nan], 0), test_ohe, test[cat_count_features]]#, np.ones(shape=(test_num.shape[0], 1))]\n\n\nt_fea = ['ps_car_13', 'ps_ind_03', 'ps_reg_03', 'ps_ind_15', 'ps_reg_01', 'ps_ind_01']\ng_fea = ['ps_car_13', 'ps_ind_03', 'ps_reg_03', 'ps_ind_15', 'ps_reg_01', 'ps_ind_01', 'ps_ind_05_cat'] + cat_fea\n\nt_fea = list(set(t_fea))\ng_fea = list(set(g_fea))\n\n#proj\nfor t in t_fea:\n    for g in g_fea:\n        if t != g:\n            s_train, s_test = proj_num_on_cat(train, test, target_column=t, group_column=g)\n            train_list.append(s_train)\n            test_list.append(s_test)\nX = sparse.hstack(train_list).tocsr()\nX_test = sparse.hstack(test_list).tocsr()\n#X = train_num\n#X_test = test_num\nall_data = np.vstack([X.toarray(), X_test.toarray()])\n#all_data = np.vstack([X, X_test])\n\nscaler = StandardScaler()\nscaler.fit(all_data)\nX = scaler.transform(X.toarray())\nX_test = scaler.transform(X_test.toarray())\n#X = scaler.transform(X)\n#X_test = scaler.transform(X_test)\n\nprint(X.shape, X_test.shape)\n\nfinal_cv_train = np.zeros(len(train_label))\nfinal_cv_pred = np.zeros(len(test_id))\nfinal_best_trees = []\n\neta = 0.1\nlamb = 0.25\nalpha = 1\nnum_boost_round = 10000\n\nparams = {\"objective\": \"binary:logistic\",\n          \"booster\": \"gbtree\",\n          \"eta\": eta,\n          \"lamb\": lamb,\n          \"alpha\": alpha,\n          \"silent\": 1\n          }\n\nif cv_only:\n    num_seeds = 3\n    for s in xrange(num_seeds):\n        print(s)\n        params['seed'] = s\n        kf = kfold.split(X, train_label)\n        cv_train = np.zeros(len(train_label))\n        cv_pred = np.zeros(len(test_id))\n        best_trees = []\n        fold_scores = []\n\n        for i, (train_fold, validate) in enumerate(kf):\n            X_train, X_validate, label_train, label_validate = \\\n                X[train_fold, :], X[validate, :], train_label[train_fold], train_label[validate]\n            dtrain = xgb.DMatrix(X_train, label_train)\n            dvalid = xgb.DMatrix(X_validate, label_validate)\n            watchlist = [(dtrain, 'train'), (dvalid, 'valid')]\n            bst = xgb.train(params, dtrain, num_boost_round, evals=watchlist, feval=evalerror, verbose_eval=100,\n                            early_stopping_rounds=50, maximize=True)\n            best_trees.append(bst.best_iteration)\n            cv_pred += bst.predict(xgb.DMatrix(X_test))\n            cv_train[validate] += bst.predict(xgb.DMatrix(X_validate), ntree_limit=bst.best_ntree_limit)\n            score = Gini(label_validate, cv_train[validate])\n            print score\n            fold_scores.append(score)\n\n        final_cv_train += cv_train\n        final_cv_pred += cv_pred\n        final_best_trees += best_trees\n        print(\"cv score:\")\n        print Gini(train_label, cv_train)\n        print(fold_scores)\n        print(best_trees, np.mean(best_trees))\n        print(\"current score:\", Gini(train_label, final_cv_train * 1. / (s + 1)), s+1)\n\n\n    final_cv_pred /= (NFOLDS * num_seeds)\n    final_cv_train /= num_seeds\n    pd.DataFrame({'id': test_id, 'target': final_cv_pred}).to_csv('../model/xgb0l_pred.csv', index=False)\n    pd.DataFrame({'id': train_id, 'target': final_cv_train}).to_csv('../model/xgb0l_cv.csv', index=False)\n    print(np.mean(final_best_trees), np.median(final_best_trees), np.std(final_best_trees))"
  },
  {
    "path": "input/readme",
    "content": "unzipped data goes here"
  },
  {
    "path": "model/readme",
    "content": "models will be saved here"
  },
  {
    "path": "readme.md",
    "content": "### Requirements\n\n*older or newer version of below packages should theoretically work fine*\n\npython 2.7\n\nnumpy 1.13.3\n\npandas 0.20.3\n\nsklearn 0.19.1\n\nkeras 2.1.1\n\ntensorflow 1.4.0\n\nxgboost 0.6\n\nlightgbm 2.0.10\n\n\n### How to reproduce\n\n#### Simple solution (recommended)\nPut unzipped data in `input`\n\n*Generate a simple solution that is good enough for 2nd place (~0.2938 on private LB)*\n\n`cd code`\n\n`python fea_eng0.py`\n\n`python nn_model290.py` to get a nn model that scores 0.290X\n\n`python gbm_model291.py` to get a gbm model that scores 0.291X\n\n`python simple_average.py` and then you can find the submission file at `../model/simple_average.csv`\n\nYou can reproduce this solution in a few hours.\n\n#### Exact solution (Optional)\n\nAlthough not recommended but you can also reproduce the exact same solution we submitted (0.29413 on private LB).\n\n*you can follow these steps below, in addition to the simple solution*\n\n```\ncd ../code_for_exact_solution/\npython keras3.py\npython keras6.py\npython keras7.py\npython lightgbm1.py\npython lightgbm5.py\npython lightgbm6.py\npython lightgbm7.py\npython lightgbm8.py\npython logistic1.py\npython xgb0.py\npython xgb_linear0.py\npython rank_average.py\n```\n\n*It can take up to 2 days to generate the exact solution which only has 0.0003 improvement over the simple one*"
  }
]