master 83d794f6dce6 cached
26 files
158.5 KB
48.6k tokens
40 symbols
1 requests
Download .txt
Repository: xiaozhouwang/kaggle-porto-seguro
Branch: master
Commit: 83d794f6dce6
Files: 26
Total size: 158.5 KB

Directory structure:
gitextract_dcpwav_6/

├── Jupyter_nnmodel/
│   ├── README.md
│   ├── fea_eng0.py
│   ├── feature_generater.py
│   ├── nn_model .ipynb
│   └── util.py
├── code/
│   ├── fea_eng0.py
│   ├── gbm_model291.py
│   ├── nn_model290.py
│   ├── simple_average.py
│   └── util.py
├── code_for_exact_solution/
│   ├── keras3.py
│   ├── keras6.py
│   ├── keras7.py
│   ├── lightgbm1.py
│   ├── lightgbm5.py
│   ├── lightgbm6.py
│   ├── lightgbm7.py
│   ├── lightgbm8.py
│   ├── logistic1.py
│   ├── rank_average.py
│   ├── util.py
│   ├── xgb0.py
│   └── xgb_linear0.py
├── input/
│   └── readme
├── model/
│   └── readme
└── readme.md

================================================
FILE CONTENTS
================================================

================================================
FILE: Jupyter_nnmodel/README.md
================================================
# Content: Jupyter Version of 2nd place code kaggle-porto-seguro
## [Porto Seguro’s Safe Driver Prediction](https://www.kaggle.com/c/porto-seguro-safe-driver-prediction) in Kaggle Competition

### Popose
The popose of this jupyter script is to make flow of code readable and easy to understand. Some code have been changed from the orignial author, but the concept of processing the 2nd place code is exactly same. For the some feature engineering processing, I put these codes into `feature_generater.py`

### Install

This project requires **Python 2.7 or 3.6** and the following Python libraries installed:

- [NumPy](http://www.numpy.org/)
- [Pandas](http://pandas.pydata.org)
- [Keras](http://matplotlib.org/)
- [scikit-learn](http://scikit-learn.org/stable/)
- [xgboost](https://xgboost.readthedocs.io/)
- [pickle](https://www.tensorflow.org/)
- [keras](https://keras.io/)
- [itertools](https://docs.python.org/2/library/itertools.html)

You will also need to have software installed to run and execute a [Jupyter Notebook](http://ipython.org/notebook.html)

If you do not have Python installed yet, it is highly recommended that you install the [Anaconda](http://continuum.io/downloads) distribution of Python, which already has the above packages and more included. Make sure that you select the Python 2.7 installer and not the Python 3.x installer. 

### Code
1. Run `fea_eng0.py` to get your first features as a pickle file `fea0.pk`. 

Note: if you are using python 3, you need to switch the code, `fea_eng0.py`, in the last line. So the pickle file would be able to read in `nn_model.ipynb`.
 
2. the manin code is provided in the `nn_model.ipynb` notebook file. You will also be required to use the included `util.py` and `feature_generater.py` Python files, the `train.csv` and `test.csv` dataset file,which you have to download from [Kaggle](https://www.kaggle.com/c/porto-seguro-safe-driver-prediction/data) into the input folder, to complete your work. While some code has already been implemented to get you started, you will need to implement additional functionality when requested to successfully complete the. During the operation of `nn_model.ipynb` in , the defualt output will be created in model folder. If you are interested in `util.py` and `feature_generater.py`, please feel free to explore these Python files. 


### Run

In a terminal or command window, navigate to the top-level project directory `Jupyter_Version/` (that contains this README) and run one of the following commands:

```bash
ipython notebook nn_model.ipynb
```  
or
```bash
jupyter notebook nn_model.ipynb
```

This will open the Jupyter Notebook software and project file in your browser.

## Data

You can download data from [Kaggle](https://www.kaggle.com/c/porto-seguro-safe-driver-prediction/data)
In the train and test data, features that belong to similar groupings are tagged as such in the feature names (e.g., ind, reg, car, calc). In addition, feature names include the postfix bin to indicate binary features and cat to indicate categorical features. Features without these designations are either continuous or ordinal. Values of -1 indicate that the feature was missing from the observation. The target columns signifies whether or not a claim was filed for that policy holder.




================================================
FILE: Jupyter_nnmodel/fea_eng0.py
================================================
"""xgb prediction as features"""
import xgboost as xgb
from sklearn.model_selection import KFold
import numpy as np
import pandas as pd

eta = 0.1
max_depth = 6
subsample = 0.9
colsample_bytree = 0.85
min_child_weight = 55
num_boost_round = 500

train = pd.read_csv("../input/train.csv")
train_label = train['target']
train_id = train['id']
del train['target'], train['id']

test = pd.read_csv("../input/test.csv")
test_id = test['id']
del test['id']


params = {"objective": "reg:linear",
          "booster": "gbtree",
          "eta": eta,
          "max_depth": int(max_depth),
          "subsample": subsample,
          "colsample_bytree": colsample_bytree,
          "min_child_weight": min_child_weight,
          "silent": 1
          }

data = train.append(test)
data.reset_index(inplace=True)
train_rows = train.shape[0]

feature_results = []

for target_g in ['car', 'ind', 'reg']:
    features = [x for x in list(data) if target_g not in x]
    target_list = [x for x in list(data) if target_g in x]
    train_fea = np.array(data[features])
    for target in target_list:
        print(target)
        train_label = data[target]
        kfold = KFold(n_splits=5, random_state=218, shuffle=True)
        kf = kfold.split(data)
        cv_train = np.zeros(shape=(data.shape[0], 1))
        for i, (train_fold, validate) in enumerate(kf):
            X_train, X_validate, label_train, label_validate = \
                train_fea[train_fold, :], train_fea[validate, :], train_label[train_fold], train_label[validate]
            dtrain = xgb.DMatrix(X_train, label_train)
            dvalid = xgb.DMatrix(X_validate, label_validate)
            watchlist = [(dtrain, 'train'), (dvalid, 'valid')]
            bst = xgb.train(params, dtrain, num_boost_round, evals=watchlist, verbose_eval=50,
                            early_stopping_rounds=10)
            cv_train[validate, 0] += bst.predict(xgb.DMatrix(X_validate), ntree_limit=bst.best_ntree_limit)
        feature_results.append(cv_train)

feature_results = np.hstack(feature_results)
train_features = feature_results[:train_rows, :]
test_features = feature_results[train_rows:, :]

import pickle
#for python 2
pickle.dump([train_features, test_features], open("../input/fea0.pk", 'wb'),protocol=2)
#for python 3 
# pickle.dump([train_features, test_features], open("../input/fea0.pk", 'wb'),protocol=3)



================================================
FILE: Jupyter_nnmodel/feature_generater.py
================================================
from util import proj_num_on_cat, Gini, interaction_features
from itertools import combinations
import numpy as np
import pandas as pd

def Multiply_Divide(train, test, features):
    """
    combinations:
    combinations(['A', 'B','C'],2)  retrun AB AC BC
    combinations(range(4), 3) --> 012 013 023 123
    """
    feature_names= []
    for e, (x, y) in enumerate(combinations(features, 2)):
        train, test, feature_name= interaction_features(train, test, x, y, e)
        for name in feature_name:
            feature_names.append(name)


    return train, test, feature_names




def Series_string(train, test, category_list):
    '''
    produce series as a string like new_ind as the following
   id        new_ind_count                    new_ind
595207            117       3_1_10_0_0_0_0_0_1_0_0_0_0_0_13_1_0_0
595208            153       5_1_3_0_0_0_0_0_1_0_0_0_0_0_6_1_0_0

    return train and test with new colunes of new_categories 
    '''
    for category in category_list:

        feature_names = list(train.columns)

        features = [c for c in feature_names if category in c]
        name= 'new_'+ category


        count = 0
        for c in features:
            if count == 0:
                train[name] = train[c].astype(str)
                count += 1
            else:
                train[name] += '_' + train[c].astype(str)

        count = 0
        for c in features:
            if count == 0:
                test[name] = test[c].astype(str)
                count += 1
            else:
                test[name] += '_' + test[c].astype(str)

    return train, test



def Features_Counts(train, test, features):
    feature_names =[]

    for c in features:
        d = pd.concat([train[c],test[c]]).value_counts().to_dict()
        train['%s_count'%c] = train[c].apply(lambda x:d.get(x,0))
        test['%s_count'%c] = test[c].apply(lambda x:d.get(x,0))
        feature_names.append('%s_count'%c)

    return train, test, feature_names


def Statistic_features(train, test, target_features, group_features):
    train_list_=[]
    test_list_=[]
    for t in target_features:
        for g in group_features:
            if t != g:
                s_train, s_test = proj_num_on_cat(train, test, target_column=t, group_column=g)
                train_list_.append(s_train)
                test_list_.append(s_test)
    return np.hstack(train_list_), np.hstack(test_list_)

def features_type(train):
    data = []
    for f in train.columns:
        
        # Defining the level
        if 'bin' in f or f == 'target':
            level = 'binary'
        elif 'cat' in f or f == 'id':
            level = 'nominal'
        elif train[f].dtype == float:
            level = 'interval'
        elif train[f].dtype == int:
            level = 'ordinal'
            
        # Initialize keep to True for all variables except for id
        keep = True
        if f == 'id':
            keep = False
        
        # Defining the data type 
        dtype = train[f].dtype
        
        # Creating a Dict that contains all the metadata for the variable
        f_dict = {
            'varname': f,
            'level': level,
            'keep': keep,
            'dtype': dtype
        }
        data.append(f_dict)
    meta = pd.DataFrame(data, columns=['varname', 'level', 'keep', 'dtype'])
    meta.set_index('varname', inplace=True)
    interval = meta[(meta.level == 'interval') & (meta.keep)].index
    ordinal = meta[(meta.level == 'ordinal') & (meta.keep)].index
    binary = meta[(meta.level == 'binary') & (meta.keep)].index
    nominal  = meta[(meta.level == 'nominal') & (meta.keep)].index
    return interval, ordinal, binary, nominal






================================================
FILE: Jupyter_nnmodel/nn_model .ipynb
================================================
{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/Users/stevenhu/anaconda3/lib/python3.6/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.\n",
      "  from ._conv import register_converters as _register_converters\n",
      "Using TensorFlow backend.\n"
     ]
    }
   ],
   "source": [
    "#homemade script\n",
    "from util import Gini\n",
    "from feature_generater import Multiply_Divide, Series_string, Features_Counts, Statistic_features\n",
    "\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "import pickle\n",
    "from scipy import sparse\n",
    "from sklearn.preprocessing import StandardScaler\n",
    "from sklearn.preprocessing import LabelEncoder\n",
    "\n",
    "#for NN model\n",
    "from keras.layers import Dense, Dropout, Embedding, Flatten, Input, Concatenate, merge\n",
    "from keras.layers.normalization import BatchNormalization\n",
    "from keras.layers.advanced_activations import PReLU\n",
    "from keras.models import Model\n",
    "from time import time\n",
    "import datetime\n",
    "from sklearn.model_selection import StratifiedKFold\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 1. Load Data #\n",
    "\n",
    "- Create train, test dataset\n",
    "- Create train target label\n",
    "- Create feature object: cat, num, bin, inter\n",
    "- Create feature columns in train: counting of miss values\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "cv_only = True\n",
    "save_cv = True\n",
    "\n",
    "#read data\n",
    "train = pd.read_csv(\"../input/train.csv\")\n",
    "train_label = train['target']\n",
    "train_id = train['id']\n",
    "del train['target'], train['id']\n",
    "\n",
    "test = pd.read_csv(\"../input/test.csv\")\n",
    "test_id = test['id']\n",
    "del test['id']\n",
    "\n",
    "\n",
    "\n",
    "#find missing value by each row and recode to column 'missing'\n",
    "train['missing'] = (train==-1).sum(axis=1).astype(float)\n",
    "test['missing'] = (test==-1).sum(axis=1).astype(float)\n",
    "\n",
    "#get all featrue name\n",
    "feature_names = list(train)\n",
    "\n",
    "# extract feature with cat, bin, num, inter\n",
    "cat_fea = [x for x in list(train) if 'cat' in x]\n",
    "bin_fea = [x for x in list(train) if 'bin' in x]\n",
    "num_features = [c for c in list(train) if ('cat' not in c and 'calc' not in c)]\n",
    "inter_fea = [x for x in list(train) if 'inter' in x]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 2. Feature Engineering #\n",
    "\n",
    "- Create Multipy and Divide feature\n",
    "- Feature of Counts(target incoding)\n",
    "- Load feature generated from Feature Engine\n",
    "- Create Statistic features\n",
    "- Combine all feature together, and get ready for training\n",
    "- Create Cat_feature for NN embeding training"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2.1 Multipy and Divide feature ##\n",
    "- moltipy each feature in the list and created new columns into train and testing data set"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "#Add features of Multiply and Divide\n",
    "features= ['ps_car_13', 'ps_ind_03', 'ps_reg_03', 'ps_ind_15', 'ps_reg_01', 'ps_ind_01']\n",
    "train, test, MD_features = Multiply_Divide(train, test, features)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>inter_0*</th>\n",
       "      <th>inter_0/</th>\n",
       "      <th>inter_1*</th>\n",
       "      <th>inter_1/</th>\n",
       "      <th>inter_2*</th>\n",
       "      <th>inter_2/</th>\n",
       "      <th>inter_3*</th>\n",
       "      <th>inter_3/</th>\n",
       "      <th>inter_4*</th>\n",
       "      <th>inter_4/</th>\n",
       "      <th>...</th>\n",
       "      <th>inter_10*</th>\n",
       "      <th>inter_10/</th>\n",
       "      <th>inter_11*</th>\n",
       "      <th>inter_11/</th>\n",
       "      <th>inter_12*</th>\n",
       "      <th>inter_12/</th>\n",
       "      <th>inter_13*</th>\n",
       "      <th>inter_13/</th>\n",
       "      <th>inter_14*</th>\n",
       "      <th>inter_14/</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>4.418395</td>\n",
       "      <td>0.176736</td>\n",
       "      <td>0.634544</td>\n",
       "      <td>1.230630</td>\n",
       "      <td>9.720468</td>\n",
       "      <td>0.080334</td>\n",
       "      <td>0.618575</td>\n",
       "      <td>1.262398</td>\n",
       "      <td>1.767358</td>\n",
       "      <td>0.441839</td>\n",
       "      <td>...</td>\n",
       "      <td>0.502649</td>\n",
       "      <td>1.025815</td>\n",
       "      <td>1.436141</td>\n",
       "      <td>0.359035</td>\n",
       "      <td>7.7</td>\n",
       "      <td>15.714286</td>\n",
       "      <td>22</td>\n",
       "      <td>5.500000</td>\n",
       "      <td>1.4</td>\n",
       "      <td>0.350000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>4.331716</td>\n",
       "      <td>0.088402</td>\n",
       "      <td>0.474062</td>\n",
       "      <td>0.807773</td>\n",
       "      <td>1.856450</td>\n",
       "      <td>0.206272</td>\n",
       "      <td>0.495053</td>\n",
       "      <td>0.773521</td>\n",
       "      <td>0.618817</td>\n",
       "      <td>0.618817</td>\n",
       "      <td>...</td>\n",
       "      <td>0.612862</td>\n",
       "      <td>0.957597</td>\n",
       "      <td>0.766078</td>\n",
       "      <td>0.766078</td>\n",
       "      <td>2.4</td>\n",
       "      <td>3.750000</td>\n",
       "      <td>3</td>\n",
       "      <td>3.000000</td>\n",
       "      <td>0.8</td>\n",
       "      <td>0.800000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>5.774271</td>\n",
       "      <td>0.071287</td>\n",
       "      <td>-0.641586</td>\n",
       "      <td>-0.641586</td>\n",
       "      <td>7.699029</td>\n",
       "      <td>0.053465</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>inf</td>\n",
       "      <td>3.207929</td>\n",
       "      <td>0.128317</td>\n",
       "      <td>...</td>\n",
       "      <td>-0.000000</td>\n",
       "      <td>-inf</td>\n",
       "      <td>-5.000000</td>\n",
       "      <td>-0.200000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>inf</td>\n",
       "      <td>60</td>\n",
       "      <td>2.400000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1.085898</td>\n",
       "      <td>0.271474</td>\n",
       "      <td>0.315425</td>\n",
       "      <td>0.934592</td>\n",
       "      <td>4.343590</td>\n",
       "      <td>0.067869</td>\n",
       "      <td>0.488654</td>\n",
       "      <td>0.603276</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>inf</td>\n",
       "      <td>...</td>\n",
       "      <td>0.522853</td>\n",
       "      <td>0.645497</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>inf</td>\n",
       "      <td>7.2</td>\n",
       "      <td>8.888889</td>\n",
       "      <td>0</td>\n",
       "      <td>inf</td>\n",
       "      <td>0.0</td>\n",
       "      <td>inf</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>0.000000</td>\n",
       "      <td>inf</td>\n",
       "      <td>0.475728</td>\n",
       "      <td>0.673001</td>\n",
       "      <td>5.092484</td>\n",
       "      <td>0.062870</td>\n",
       "      <td>0.396082</td>\n",
       "      <td>0.808331</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>inf</td>\n",
       "      <td>...</td>\n",
       "      <td>0.588531</td>\n",
       "      <td>1.201084</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>inf</td>\n",
       "      <td>6.3</td>\n",
       "      <td>12.857143</td>\n",
       "      <td>0</td>\n",
       "      <td>inf</td>\n",
       "      <td>0.0</td>\n",
       "      <td>inf</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5 rows × 30 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "   inter_0*  inter_0/  inter_1*  inter_1/  inter_2*  inter_2/  inter_3*  \\\n",
       "0  4.418395  0.176736  0.634544  1.230630  9.720468  0.080334  0.618575   \n",
       "1  4.331716  0.088402  0.474062  0.807773  1.856450  0.206272  0.495053   \n",
       "2  5.774271  0.071287 -0.641586 -0.641586  7.699029  0.053465  0.000000   \n",
       "3  1.085898  0.271474  0.315425  0.934592  4.343590  0.067869  0.488654   \n",
       "4  0.000000       inf  0.475728  0.673001  5.092484  0.062870  0.396082   \n",
       "\n",
       "   inter_3/  inter_4*  inter_4/    ...      inter_10*  inter_10/  inter_11*  \\\n",
       "0  1.262398  1.767358  0.441839    ...       0.502649   1.025815   1.436141   \n",
       "1  0.773521  0.618817  0.618817    ...       0.612862   0.957597   0.766078   \n",
       "2       inf  3.207929  0.128317    ...      -0.000000       -inf  -5.000000   \n",
       "3  0.603276  0.000000       inf    ...       0.522853   0.645497   0.000000   \n",
       "4  0.808331  0.000000       inf    ...       0.588531   1.201084   0.000000   \n",
       "\n",
       "   inter_11/  inter_12*  inter_12/  inter_13*  inter_13/  inter_14*  inter_14/  \n",
       "0   0.359035        7.7  15.714286         22   5.500000        1.4   0.350000  \n",
       "1   0.766078        2.4   3.750000          3   3.000000        0.8   0.800000  \n",
       "2  -0.200000        0.0        inf         60   2.400000        0.0   0.000000  \n",
       "3        inf        7.2   8.888889          0        inf        0.0        inf  \n",
       "4        inf        6.3  12.857143          0        inf        0.0        inf  \n",
       "\n",
       "[5 rows x 30 columns]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "display(train[MD_features].head(5))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2.2 Feature of Counts ##\n",
    "1. Generate new_ind, new_reg, new_car\n",
    "2. Count the number of distinct values of\n",
    "    cat features, new_ind, new_reg and new_car"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "\n",
    "'''\n",
    "create 1_0_1_1..... data as new_xxx\n",
    "\n",
    "new_ind: collect all data from all relative \"ind\" columns, then generate series number\n",
    "\n",
    "new_reg, new_car for train and test data \n",
    "For RNN processing, generating a sequence number\n",
    "'''\n",
    "\n",
    "\n",
    "category_list = ['ind', 'reg', 'car']\n",
    "#add 'new_ind','new_reg','new_car' in train and test dataset\n",
    "train, test = Series_string(train,test,category_list )\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>new_ind</th>\n",
       "      <th>new_reg</th>\n",
       "      <th>new_car</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>2_2_5_1_0_0_1_0_0_0_0_0_0_0_11_0_1_0</td>\n",
       "      <td>0.7_0.2_0.7180703308</td>\n",
       "      <td>10_1_-1_0_1_4_1_0_0_1_12_2_0.4_0.8836789178_0....</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1_1_7_0_0_0_0_1_0_0_0_0_0_0_3_0_0_1</td>\n",
       "      <td>0.8_0.4_0.7660776723</td>\n",
       "      <td>11_1_-1_0_-1_11_1_1_2_1_19_3_0.316227766_0.618...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>5_4_9_1_0_0_0_1_0_0_0_0_0_0_12_1_0_0</td>\n",
       "      <td>0.0_0.0_-1.0</td>\n",
       "      <td>7_1_-1_0_-1_14_1_1_2_1_60_1_0.316227766_0.6415...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>0_1_2_0_0_1_0_0_0_0_0_0_0_0_8_1_0_0</td>\n",
       "      <td>0.9_0.2_0.5809475019</td>\n",
       "      <td>7_1_0_0_1_11_1_1_3_1_104_1_0.3741657387_0.5429...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>0_2_0_1_0_1_0_0_0_0_0_0_0_0_9_1_0_0</td>\n",
       "      <td>0.7_0.6_0.840758586</td>\n",
       "      <td>11_1_-1_0_-1_14_1_1_2_1_82_3_0.3160696126_0.56...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                new_ind               new_reg  \\\n",
       "0  2_2_5_1_0_0_1_0_0_0_0_0_0_0_11_0_1_0  0.7_0.2_0.7180703308   \n",
       "1   1_1_7_0_0_0_0_1_0_0_0_0_0_0_3_0_0_1  0.8_0.4_0.7660776723   \n",
       "2  5_4_9_1_0_0_0_1_0_0_0_0_0_0_12_1_0_0          0.0_0.0_-1.0   \n",
       "3   0_1_2_0_0_1_0_0_0_0_0_0_0_0_8_1_0_0  0.9_0.2_0.5809475019   \n",
       "4   0_2_0_1_0_1_0_0_0_0_0_0_0_0_9_1_0_0   0.7_0.6_0.840758586   \n",
       "\n",
       "                                             new_car  \n",
       "0  10_1_-1_0_1_4_1_0_0_1_12_2_0.4_0.8836789178_0....  \n",
       "1  11_1_-1_0_-1_11_1_1_2_1_19_3_0.316227766_0.618...  \n",
       "2  7_1_-1_0_-1_14_1_1_2_1_60_1_0.316227766_0.6415...  \n",
       "3  7_1_0_0_1_11_1_1_3_1_104_1_0.3741657387_0.5429...  \n",
       "4  11_1_-1_0_-1_14_1_1_2_1_82_3_0.3160696126_0.56...  "
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "display(train[['new_ind','new_reg','new_car']].head(5))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "'''\n",
    "count_features\n",
    "\n",
    "preparing for train[cat_count_features] \n",
    "cat_fea = \n",
    "['ps_ind_02_cat','ps_ind_04_cat','ps_ind_05_cat', 'ps_car_01_cat', 'ps_car_02_cat',\n",
    " 'ps_car_03_cat','ps_car_04_cat','ps_car_05_cat','ps_car_06_cat','ps_car_07_cat',\n",
    " 'ps_car_08_cat','ps_car_09_cat','ps_car_10_cat', 'ps_car_11_cat']\n",
    "\n",
    "Example: \n",
    "ps_ind_02_cat_count\n",
    "dictionay of ps_ind_02_cat \n",
    "([(1, 1079327), (2, 309747), (3, 70172), (4, 28259), (-1, 523)])\n",
    "\n",
    "row        count     origial value\n",
    "595202    1079327       1     \n",
    "595203     309747       2\n",
    "595204     309747       2\n",
    "595205      70172       3\n",
    "595206    1079327       1\n",
    "\n",
    "''' \n",
    "\n",
    "cat_fea = [ name for name in list(train) if 'cat' in name and 'count' not in name]\n",
    "features= cat_fea + ['new_ind','new_reg','new_car']\n",
    "\n",
    "train, test, cat_count_features= Features_Counts(train, test, features)\n",
    "\n",
    "\n",
    "    "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>ps_ind_02_cat_count</th>\n",
       "      <th>ps_ind_04_cat_count</th>\n",
       "      <th>ps_ind_05_cat_count</th>\n",
       "      <th>ps_car_01_cat_count</th>\n",
       "      <th>ps_car_02_cat_count</th>\n",
       "      <th>ps_car_03_cat_count</th>\n",
       "      <th>ps_car_04_cat_count</th>\n",
       "      <th>ps_car_05_cat_count</th>\n",
       "      <th>ps_car_06_cat_count</th>\n",
       "      <th>ps_car_07_cat_count</th>\n",
       "      <th>ps_car_08_cat_count</th>\n",
       "      <th>ps_car_09_cat_count</th>\n",
       "      <th>ps_car_10_cat_count</th>\n",
       "      <th>ps_car_11_cat_count</th>\n",
       "      <th>new_ind_count</th>\n",
       "      <th>new_reg_count</th>\n",
       "      <th>new_car_count</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>309747</td>\n",
       "      <td>620936</td>\n",
       "      <td>1319412</td>\n",
       "      <td>124587</td>\n",
       "      <td>1234979</td>\n",
       "      <td>1028142</td>\n",
       "      <td>1241334</td>\n",
       "      <td>431560</td>\n",
       "      <td>77845</td>\n",
       "      <td>1383070</td>\n",
       "      <td>249663</td>\n",
       "      <td>486510</td>\n",
       "      <td>1475460</td>\n",
       "      <td>18326</td>\n",
       "      <td>6</td>\n",
       "      <td>24</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1079327</td>\n",
       "      <td>866864</td>\n",
       "      <td>1319412</td>\n",
       "      <td>518725</td>\n",
       "      <td>1234979</td>\n",
       "      <td>1028142</td>\n",
       "      <td>1241334</td>\n",
       "      <td>666910</td>\n",
       "      <td>329890</td>\n",
       "      <td>1383070</td>\n",
       "      <td>1238365</td>\n",
       "      <td>883326</td>\n",
       "      <td>1475460</td>\n",
       "      <td>12535</td>\n",
       "      <td>36</td>\n",
       "      <td>38</td>\n",
       "      <td>11</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>28259</td>\n",
       "      <td>620936</td>\n",
       "      <td>1319412</td>\n",
       "      <td>449617</td>\n",
       "      <td>1234979</td>\n",
       "      <td>1028142</td>\n",
       "      <td>1241334</td>\n",
       "      <td>666910</td>\n",
       "      <td>147714</td>\n",
       "      <td>1383070</td>\n",
       "      <td>1238365</td>\n",
       "      <td>883326</td>\n",
       "      <td>1475460</td>\n",
       "      <td>19943</td>\n",
       "      <td>24</td>\n",
       "      <td>13477</td>\n",
       "      <td>40</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1079327</td>\n",
       "      <td>866864</td>\n",
       "      <td>1319412</td>\n",
       "      <td>449617</td>\n",
       "      <td>1234979</td>\n",
       "      <td>183044</td>\n",
       "      <td>1241334</td>\n",
       "      <td>431560</td>\n",
       "      <td>329890</td>\n",
       "      <td>1383070</td>\n",
       "      <td>1238365</td>\n",
       "      <td>36798</td>\n",
       "      <td>1475460</td>\n",
       "      <td>212989</td>\n",
       "      <td>2784</td>\n",
       "      <td>222</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>309747</td>\n",
       "      <td>620936</td>\n",
       "      <td>1319412</td>\n",
       "      <td>518725</td>\n",
       "      <td>1234979</td>\n",
       "      <td>1028142</td>\n",
       "      <td>1241334</td>\n",
       "      <td>666910</td>\n",
       "      <td>147714</td>\n",
       "      <td>1383070</td>\n",
       "      <td>1238365</td>\n",
       "      <td>883326</td>\n",
       "      <td>1475460</td>\n",
       "      <td>26161</td>\n",
       "      <td>258</td>\n",
       "      <td>34</td>\n",
       "      <td>13</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   ps_ind_02_cat_count  ps_ind_04_cat_count  ps_ind_05_cat_count  \\\n",
       "0               309747               620936              1319412   \n",
       "1              1079327               866864              1319412   \n",
       "2                28259               620936              1319412   \n",
       "3              1079327               866864              1319412   \n",
       "4               309747               620936              1319412   \n",
       "\n",
       "   ps_car_01_cat_count  ps_car_02_cat_count  ps_car_03_cat_count  \\\n",
       "0               124587              1234979              1028142   \n",
       "1               518725              1234979              1028142   \n",
       "2               449617              1234979              1028142   \n",
       "3               449617              1234979               183044   \n",
       "4               518725              1234979              1028142   \n",
       "\n",
       "   ps_car_04_cat_count  ps_car_05_cat_count  ps_car_06_cat_count  \\\n",
       "0              1241334               431560                77845   \n",
       "1              1241334               666910               329890   \n",
       "2              1241334               666910               147714   \n",
       "3              1241334               431560               329890   \n",
       "4              1241334               666910               147714   \n",
       "\n",
       "   ps_car_07_cat_count  ps_car_08_cat_count  ps_car_09_cat_count  \\\n",
       "0              1383070               249663               486510   \n",
       "1              1383070              1238365               883326   \n",
       "2              1383070              1238365               883326   \n",
       "3              1383070              1238365                36798   \n",
       "4              1383070              1238365               883326   \n",
       "\n",
       "   ps_car_10_cat_count  ps_car_11_cat_count  new_ind_count  new_reg_count  \\\n",
       "0              1475460                18326              6             24   \n",
       "1              1475460                12535             36             38   \n",
       "2              1475460                19943             24          13477   \n",
       "3              1475460               212989           2784            222   \n",
       "4              1475460                26161            258             34   \n",
       "\n",
       "   new_car_count  \n",
       "0              1  \n",
       "1             11  \n",
       "2             40  \n",
       "3              1  \n",
       "4             13  "
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "# display(train[['new_ind','new_reg','new_car']].head(5))\n",
    "display(train[cat_count_features].head(5))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2.3 Get the feature from feature training ## "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "train_fea0, test_fea0 = pickle.load(open(\"../input/fea0.pk\",'rb'), encoding='iso-8859-1')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2.4 Statistic features ##\n",
    "\n",
    "- find the feature of median, mean and standard deviation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "#feature aggregation\n",
    "target_features = ['ps_car_13', 'ps_ind_03', 'ps_reg_03', 'ps_ind_15', 'ps_reg_01', 'ps_ind_01']\n",
    "group_features = ['ps_car_13', 'ps_ind_03', 'ps_reg_03', 'ps_ind_15', 'ps_reg_01', 'ps_ind_01', 'ps_ind_05_cat']\n",
    "\n",
    "#return numpy because we need to do np.hstack to merge all statistic feature together, so that it would return np array\n",
    "train_statis, test_statis =  Statistic_features(train, test, target_features, group_features)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[  1.57043000e+05,   8.32786207e-01,   2.41530046e-01, ...,\n",
       "          1.00000000e+00,   7.00000000e+00,   0.00000000e+00],\n",
       "       [  1.30452000e+05,   8.26528390e-01,   2.35133348e-01, ...,\n",
       "          1.00000000e+00,   7.00000000e+00,   0.00000000e+00],\n",
       "       [  6.35510000e+04,   8.13168936e-01,   2.35946815e-01, ...,\n",
       "          1.00000000e+00,   7.00000000e+00,   0.00000000e+00],\n",
       "       ..., \n",
       "       [  3.58630000e+04,   8.00633360e-01,   2.34463222e-01, ...,\n",
       "          1.00000000e+00,   7.00000000e+00,   0.00000000e+00],\n",
       "       [  2.04836000e+05,   8.24270444e-01,   2.28649975e-01, ...,\n",
       "          1.00000000e+00,   7.00000000e+00,   0.00000000e+00],\n",
       "       [  9.85280000e+04,   8.24453229e-01,   2.37806003e-01, ...,\n",
       "          1.00000000e+00,   7.00000000e+00,   0.00000000e+00]])"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "display(train_statis)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2.5 Combine all feature together, and get ready for training ##\n",
    "\n",
    "1. merge features into train_list & test_list that would like to dump into NN model\n",
    "2. training a scaler by sparse that generated by train_list & test_list\n",
    "4. convert train_list & test_list into X , X_test, which has been scaled"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "'''\n",
    "Building a train list including train_num, cat_count_features, statistic feature and infered feature training by XGboost.\n",
    "\n",
    "train_num: training data without set of cat_calc\n",
    "cat_count_features: cat_fea + ['new_ind','new_reg','new_car']\n",
    "train_fea0: feature extraction \n",
    "'''\n",
    "\n",
    "\n",
    "#training data without set of cat_calc\n",
    "train_num = train[[x for x in list(train) if x in num_features]]\n",
    "test_num = test[[x for x in list(train) if x in num_features]]\n",
    "\n",
    "train_list = [train_num.replace([np.inf, -np.inf, np.nan], 0), train[cat_count_features], train_statis, train_fea0 ]\n",
    "test_list = [test_num.replace([np.inf, -np.inf, np.nan], 0), test[cat_count_features], test_statis,test_fea0] "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "'''\n",
    "X are stacked from 5 features\n",
    "1. train_num(595212,54): training data without set of cat_calc\n",
    "2. cat_count_features(595212,17): cat_fea + ['new_ind','new_reg','new_car']\n",
    "3. feature statis(595212,6) * 36\n",
    "4. train_fea0(595212, 38): feature extraction\n",
    "\n",
    "all_data (595212, 235)\n",
    "'''\n",
    "\n",
    "\n",
    "X = sparse.hstack(train_list).tocsr()\n",
    "X_test = sparse.hstack(test_list).tocsr()\n",
    "\n",
    "all_data = np.vstack([X.toarray(), X_test.toarray()])\n",
    "scaler = StandardScaler()\n",
    "scaler.fit(all_data)\n",
    "X = scaler.transform(X.toarray())\n",
    "X_test = scaler.transform(X_test.toarray())\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2.6 Create Cat_feature for NN embeding training ##\n",
    " Don't ask why they doing this! they only tell you what is this\n",
    " \n",
    " - in the feature NN model of the finial testing data, you would need the list, which likes **[[cat_featrue], X]**\n",
    " or **[['ps_ind_02_cat', 'ps_ind_04_cat', 'ps_ind_05_cat',.....,'ps_car_11_cat'], X]**\n",
    " \n",
    " **_This is to process the above testing data. If you could not understand, that is fine, and just look the next steps_**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/Users/stevenhu/anaconda3/envs/run/lib/python3.6/site-packages/pandas/core/indexing.py:630: SettingWithCopyWarning: \n",
      "A value is trying to be set on a copy of a slice from a DataFrame.\n",
      "Try using .loc[row_indexer,col_indexer] = value instead\n",
      "\n",
      "See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy\n",
      "  self.obj[item_labels[indexer[info_axis]]] = value\n"
     ]
    }
   ],
   "source": [
    "#preparing for training cat \n",
    "train_cat = train[cat_fea]\n",
    "test_cat = test[cat_fea]\n",
    "\n",
    "# convert pd to np.array\n",
    "X_cat = train_cat.values\n",
    "tem = test_cat.values\n",
    "\n",
    "# storing the dimension for embedding layer as an input value\n",
    "max_cat_values = []\n",
    "\n",
    "for c in cat_fea:\n",
    "    \n",
    "    #nomalize the label\n",
    "    #LabelEncoder: http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html\n",
    "    \n",
    "    le = LabelEncoder()\n",
    "    x = le.fit_transform(pd.concat([train_cat, test_cat])[c])\n",
    "    train_cat.loc[:,c] = le.transform(train_cat[c])\n",
    "    test_cat.loc[:,c] = le.transform(test_cat[c])\n",
    "    max_cat_values.append(np.max(x))\n",
    "\n",
    "# Build the final testing data\n",
    "X_TEST_CAT = []\n",
    "for i in range(tem.shape[1]):\n",
    "    X_TEST_CAT.append(tem[:, i].reshape(-1, 1))\n",
    "X_TEST_CAT.append(X_test)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 57,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "cat_fea: ['ps_ind_02_cat', 'ps_ind_04_cat', 'ps_ind_05_cat', 'ps_car_01_cat', 'ps_car_02_cat', 'ps_car_03_cat', 'ps_car_04_cat', 'ps_car_05_cat', 'ps_car_06_cat', 'ps_car_07_cat', 'ps_car_08_cat', 'ps_car_09_cat', 'ps_car_10_cat', 'ps_car_11_cat']\n",
      "\n",
      "max_cat_values:  [4, 2, 7, 12, 2, 2, 9, 2, 17, 2, 1, 5, 2, 103]\n"
     ]
    }
   ],
   "source": [
    "print('cat_fea:', cat_fea)\n",
    "print('\\nmax_cat_values: ',max_cat_values)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 3. Training NN Model with Keras # "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "1. Build the model\n",
    "2. training"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3.1 Build the model ##"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### model structure:  ###\n",
    "<img src=\"Jupyter_image/NN_layer.png\">"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 47,
   "metadata": {},
   "outputs": [],
   "source": [
    "def nn_model():\n",
    "    inputs = []\n",
    "    flatten_layers = []\n",
    "    for e, c in enumerate(cat_fea):\n",
    "        input_c = Input(shape=(1, ), dtype='int32')\n",
    "        num_c = max_cat_values[e]\n",
    "        \n",
    "        # need to add 1, https://keras.io/layers/embeddings/\n",
    "        # **input_dim: int > 0. Size of the vocabulary, i.e. maximum integer index + 1.**\n",
    "        embed_c = Embedding(num_c+1,6,input_length=1)(input_c)\n",
    "        embed_c = Dropout(0.25)(embed_c)\n",
    "        flatten_c = Flatten()(embed_c)\n",
    "        inputs.append(input_c)\n",
    "        flatten_layers.append(flatten_c)\n",
    "        \n",
    "    \n",
    "    input_num = Input(shape=(X.shape[1],), dtype='float32')\n",
    "    inputs.append(input_num)\n",
    "    \n",
    "    #merge X and embedding layer\n",
    "    flatten_layers.append(input_num)\n",
    "    flatten = merge(flatten_layers, mode='concat')\n",
    "\n",
    "    fc1 = Dense(512, kernel_initializer='he_normal')(flatten)\n",
    "    fc1 = PReLU()(fc1)\n",
    "    fc1 = BatchNormalization()(fc1)\n",
    "    fc1 = Dropout(0.75)(fc1)\n",
    "\n",
    "    fc1 = Dense(64, kernel_initializer='he_normal')(fc1)\n",
    "    fc1 = PReLU()(fc1)\n",
    "    fc1 = BatchNormalization()(fc1)\n",
    "    fc1 = Dropout(0.5)(fc1)\n",
    "\n",
    "    outputs = Dense(1, kernel_initializer='he_normal', activation='sigmoid')(fc1)\n",
    "\n",
    "    model = Model(inputs = inputs, outputs = outputs)\n",
    "    model.compile(loss='binary_crossentropy', optimizer='adam')\n",
    "    return (model)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3.2 Start to Train ##"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 51,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/Users/stevenhu/anaconda3/envs/run/lib/python3.6/site-packages/ipykernel_launcher.py:22: UserWarning: The `merge` function is deprecated and will be removed after 08/2017. Use instead layers from `keras.layers.merge`, e.g. `add`, `concatenate`, etc.\n",
      "/Users/stevenhu/anaconda3/envs/run/lib/python3.6/site-packages/keras/legacy/layers.py:465: UserWarning: The `Merge` layer is deprecated and will be removed after 08/2017. Use instead layers from `keras.layers.merge`, e.g. `add`, `concatenate`, etc.\n",
      "  name=name)\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Train on 297606 samples, validate on 297606 samples\n",
      "Epoch 1/1\n",
      " - 27s - loss: 0.3061 - val_loss: 0.1639\n",
      "local fold Gini:  0.209322322663\n",
      "Train on 297606 samples, validate on 297606 samples\n",
      "Epoch 1/1\n",
      " - 28s - loss: 0.3087 - val_loss: 0.1645\n",
      "local fold Gini:  0.201585256464\n",
      "seed 0: Gini 0.20545379936910999\n",
      "Total training time:  0:01:55.265348\n"
     ]
    }
   ],
   "source": [
    "\"\"\"\n",
    "#validation fold\n",
    "NFOLDS = 5\n",
    "kfold = StratifiedKFold(n_splits=NFOLDS, shuffle=True, random_state=218)\n",
    "\n",
    "I change \"test\" to \"vaild\" because I feel it is clear to understand\n",
    "\"\"\"\n",
    "\n",
    "cv_train = np.zeros(len(train_label))\n",
    "cv_pred = np.zeros(len(test_id))\n",
    "\n",
    "#validation fold\n",
    "NFOLDS = 5\n",
    "kfold = StratifiedKFold(n_splits=NFOLDS, shuffle=True, random_state=218)\n",
    "\n",
    "#with different random see make result stable.\n",
    "num_seeds = 5\n",
    "begintime = time()\n",
    "if cv_only:\n",
    "    for s in range(num_seeds):\n",
    "        np.random.seed(s)\n",
    "        for (train_index, valid_index) in kfold.split(X, train_label):\n",
    "            \n",
    "            #assign data from training data and labels to validation data; \n",
    "            x_train = X[train_index]\n",
    "            y_train = train_label[train_index]\n",
    "            x_valid= X[valid_index]\n",
    "            y_valid = train_label[valid_index]\n",
    "            \n",
    "            # assign X_cat to validation data; \n",
    "            x_train_cat = X_cat[train_index]\n",
    "            x_valid_cat = X_cat[valid_index]\n",
    "\n",
    "            #Package data for training, the package(list) is  [[cat_featrues], x_train] \n",
    "            # or [ ['ps_ind_02_cat', 'ps_ind_04_cat', 'ps_ind_05_cat',.....,'ps_car_11_cat'] ,x_train]\n",
    "            \n",
    "            x_train_cat_list, x_valid_cat_list = [], []\n",
    "            for i in range(x_train_cat.shape[1]):\n",
    "                x_train_cat_list.append(x_train_cat[:, i].reshape(-1, 1))\n",
    "                x_valid_cat_list.append(x_valid_cat[:, i].reshape(-1, 1))\n",
    "\n",
    "            x_train_cat_list.append(x_train)\n",
    "            x_valid_cat_list.append(x_valid)\n",
    "            \n",
    "            #load model\n",
    "            model = nn_model()\n",
    "            \n",
    "            def get_rank(x):\n",
    "                return pd.Series(x).rank(pct=True).values\n",
    "            #fit model. Note: Change epochs to make prediction accuracy\n",
    "            model.fit(x_train_cat_list, y_train, epochs=10, batch_size=512, verbose=2, validation_data=[x_valid_cat_list, y_valid])\n",
    "            \n",
    "            #record prediction with validation data\n",
    "            cv_train[valid_index] += get_rank(model.predict(x=x_valid_cat_list, batch_size=512, verbose=0)[:, 0])\n",
    "            print('local fold Gini: ',Gini(train_label[valid_index], cv_train[valid_index]))\n",
    "            \n",
    "            #recode prediction with testing data\n",
    "            cv_pred += get_rank(model.predict(x=X_TEST_CAT, batch_size=512, verbose=0)[:, 0])\n",
    "             \n",
    "            \n",
    "        \n",
    "        print(\"seed {0}: Gini {1}\".format(s,Gini(train_label, cv_train / (1. * (s + 1)))))\n",
    "        print(\"Total training time: \",str(datetime.timedelta(seconds=time() - begintime)))\n",
    "    if save_cv:\n",
    "        \n",
    "        #divid (NFOLDS * num_seeds) to get average of probablity \n",
    "        pd.DataFrame({'id': test_id, 'target': get_rank(cv_pred * 1./ (NFOLDS * num_seeds))}).to_csv('../model/keras5_pred.csv', index=False)\n",
    "        pd.DataFrame({'id': train_id, 'target': get_rank(cv_train * 1. / num_seeds)}).to_csv('../model/keras5_cv.csv', index=False)\n",
    "\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python [default]",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}


================================================
FILE: Jupyter_nnmodel/util.py
================================================
import numpy as np
import pandas as pd

def Gini(y_true, y_pred):
    # check and get number of samples
    assert y_true.shape == y_pred.shape
    n_samples = y_true.shape[0]

    # sort rows on prediction column
    # (from largest to smallest)
    arr = np.array([y_true, y_pred]).transpose()
    true_order = arr[arr[:, 0].argsort()][::-1, 0]
    pred_order = arr[arr[:, 1].argsort()][::-1, 0]

    # get Lorenz curves
    L_true = np.cumsum(true_order) * 1. / np.sum(true_order)
    L_pred = np.cumsum(pred_order) * 1. / np.sum(pred_order)
    L_ones = np.linspace(1 / n_samples, 1, n_samples)

    # get Gini coefficients (area between curves)
    G_true = np.sum(L_ones - L_true)
    G_pred = np.sum(L_ones - L_pred)

    # normalize to true Gini coefficient
    return G_pred * 1. / G_true


def cat_count(train_df, test_df, cat_list):
    train_df['row_id'] = range(train_df.shape[0])
    test_df['row_id'] = range(test_df.shape[0])
    train_df['train'] = 1
    test_df['train'] = 0
    all_df = train_df[['row_id', 'train'] + cat_list].append(test_df[['row_id','train'] + cat_list])
    for e, cat in enumerate(cat_list):
        grouped = all_df[[cat]].groupby(cat)
        the_size = pd.DataFrame(grouped.size()).reset_index()
        the_size.columns = [cat, '{}_size'.format(cat)]
        all_df = pd.merge(all_df, the_size, how='left')

        selected_train = all_df[all_df['train'] == 1]
        selected_test = all_df[all_df['train'] == 0]
        selected_train.sort_values('row_id', inplace=True)
        selected_test.sort_values('row_id', inplace=True)
        selected_train.drop(['row_id', 'train'] + cat_list, axis=1, inplace=True)
        selected_test.drop(['row_id', 'train'] + cat_list, axis=1, inplace=True)

        selected_train, selected_test = np.array(selected_train), np.array(selected_test)
    print(selected_train.shape, selected_test.shape)
    return selected_train, selected_test


def proj_num_on_cat(train_df, test_df, target_column, group_column):
    """
    :param train_df: train data frame
    :param test_df:  test data frame
    :param target_column: name of numerical feature
    :param group_column: name of categorical feature
    """
    train_df['row_id'] = range(train_df.shape[0])  # 595211 create index for each row
    
    test_df['row_id'] = range(test_df.shape[0])
    train_df['train'] = 1
    test_df['train'] = 0
    
    all_df = train_df[['row_id', 'train', target_column, group_column]].append(test_df[['row_id','train',
                                                                                        target_column, group_column]]).copy()
    
    
    #count the number
    grouped = all_df[[target_column, group_column]].groupby(group_column)
    
    
    #count the number of distint value  from the list [1,1, 2,3]   
    #[1,2,3]  so answer is  3 
    
    #count the number of each distint value  [1,1,2,3]  
    #1:2 
    #2:1 
    #3:1
    
    
    #count the number of each distint value
    the_size = pd.DataFrame(grouped.size()).reset_index()
    the_size.columns = [group_column, '%s_size' % target_column]  #rename columns name

    #find the mean, std, median, max, min of each distint value
    the_mean = pd.DataFrame(grouped.mean()).reset_index()
    the_mean.columns = [group_column, '%s_mean' % target_column] #rename columns name
    the_std = pd.DataFrame(grouped.std()).reset_index().fillna(0)
    the_std.columns = [group_column, '%s_std' % target_column]
    the_median = pd.DataFrame(grouped.median()).reset_index()
    the_median.columns = [group_column, '%s_median' % target_column]
    the_max = pd.DataFrame(grouped.max()).reset_index()
    the_max.columns = [group_column, '%s_max' % target_column]
    the_min = pd.DataFrame(grouped.min()).reset_index()
    the_min.columns = [group_column, '%s_min' % target_column]
    
    #merge them 
    the_stats=pd.concat([the_size,the_mean.iloc[:,1],the_std.iloc[:,1]
                         ,the_median.iloc[:,1] ,the_max.iloc[:,1],the_min.iloc[:,1]]
                        ,axis=1, join_axes=[the_size.index])

    #insert value to the original data
    all_df = pd.merge(all_df, the_stats, how='left')

    #splite to train and test
    selected_train = all_df[all_df['train'] == 1].copy()
    selected_test = all_df[all_df['train'] == 0].copy()
    selected_train.sort_values('row_id', inplace=True)
    selected_test.sort_values('row_id', inplace=True)
    
    #remove target_column, group_column, 'row_id', 'train' columns
    selected_train.drop([target_column, group_column, 'row_id', 'train'], axis=1, inplace=True)
    selected_test.drop([target_column, group_column, 'row_id', 'train'], axis=1, inplace=True)

    selected_train, selected_test = np.array(selected_train), np.array(selected_test)
    return selected_train, selected_test


def interaction_features(train, test, fea1, fea2, prefix):
    train['inter_{}*'.format(prefix)] = train[fea1] * train[fea2]
    train['inter_{}/'.format(prefix)] = train[fea1] / train[fea2]

    test['inter_{}*'.format(prefix)] = test[fea1] * test[fea2]
    test['inter_{}/'.format(prefix)] = test[fea1] / test[fea2]
    feature_name = ['inter_{}*'.format(prefix), 'inter_{}/'.format(prefix), ]

    return train, test, feature_name


================================================
FILE: code/fea_eng0.py
================================================
"""xgb prediction as features"""
import xgboost as xgb
from sklearn.model_selection import KFold
import numpy as np
import pandas as pd

eta = 0.1
max_depth = 6
subsample = 0.9
colsample_bytree = 0.85
min_child_weight = 55
num_boost_round = 500

train = pd.read_csv("../input/train.csv")
train_label = train['target']
train_id = train['id']
del train['target'], train['id']

test = pd.read_csv("../input/test.csv")
test_id = test['id']
del test['id']


params = {"objective": "reg:linear",
          "booster": "gbtree",
          "eta": eta,
          "max_depth": int(max_depth),
          "subsample": subsample,
          "colsample_bytree": colsample_bytree,
          "min_child_weight": min_child_weight,
          "silent": 1
          }

data = train.append(test)
data.reset_index(inplace=True)
train_rows = train.shape[0]

feature_results = []

for target_g in ['car', 'ind', 'reg']:
    features = [x for x in list(data) if target_g not in x]
    target_list = [x for x in list(data) if target_g in x]
    train_fea = np.array(data[features])
    for target in target_list:
        print(target)
        train_label = data[target]
        kfold = KFold(n_splits=5, random_state=218, shuffle=True)
        kf = kfold.split(data)
        cv_train = np.zeros(shape=(data.shape[0], 1))
        for i, (train_fold, validate) in enumerate(kf):
            X_train, X_validate, label_train, label_validate = \
                train_fea[train_fold, :], train_fea[validate, :], train_label[train_fold], train_label[validate]
            dtrain = xgb.DMatrix(X_train, label_train)
            dvalid = xgb.DMatrix(X_validate, label_validate)
            watchlist = [(dtrain, 'train'), (dvalid, 'valid')]
            bst = xgb.train(params, dtrain, num_boost_round, evals=watchlist, verbose_eval=50,
                            early_stopping_rounds=10)
            cv_train[validate, 0] += bst.predict(xgb.DMatrix(X_validate), ntree_limit=bst.best_ntree_limit)
        feature_results.append(cv_train)

feature_results = np.hstack(feature_results)
train_features = feature_results[:train_rows, :]
test_features = feature_results[train_rows:, :]

import pickle
pickle.dump([train_features, test_features], open("../input/fea0.pk", 'wb'))


================================================
FILE: code/gbm_model291.py
================================================
import lightgbm as lgbm
from scipy import sparse as ssp
from sklearn.model_selection import StratifiedKFold
import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder

def Gini(y_true, y_pred):
    # check and get number of samples
    assert y_true.shape == y_pred.shape
    n_samples = y_true.shape[0]

    # sort rows on prediction column
    # (from largest to smallest)
    arr = np.array([y_true, y_pred]).transpose()
    true_order = arr[arr[:, 0].argsort()][::-1, 0]
    pred_order = arr[arr[:, 1].argsort()][::-1, 0]

    # get Lorenz curves
    L_true = np.cumsum(true_order) * 1. / np.sum(true_order)
    L_pred = np.cumsum(pred_order) * 1. / np.sum(pred_order)
    L_ones = np.linspace(1 / n_samples, 1, n_samples)

    # get Gini coefficients (area between curves)
    G_true = np.sum(L_ones - L_true)
    G_pred = np.sum(L_ones - L_pred)

    # normalize to true Gini coefficient
    return G_pred * 1. / G_true

cv_only = True
save_cv = True
full_train = False

def evalerror(preds, dtrain):
    labels = dtrain.get_label()
    return 'gini', Gini(labels, preds), True

path = "../input/"

train = pd.read_csv(path+'train.csv')
train_label = train['target']
train_id = train['id']
test = pd.read_csv(path+'test.csv')
test_id = test['id']

NFOLDS = 5
kfold = StratifiedKFold(n_splits=NFOLDS, shuffle=True, random_state=218)

y = train['target'].values
drop_feature = [
    'id',
    'target'
]

X = train.drop(drop_feature,axis=1)
feature_names = X.columns.tolist()
cat_features = [c for c in feature_names if ('cat' in c and 'count' not in c)]
num_features = [c for c in feature_names if ('cat' not in c and 'calc' not in c)]

train['missing'] = (train==-1).sum(axis=1).astype(float)
test['missing'] = (test==-1).sum(axis=1).astype(float)
num_features.append('missing')

for c in cat_features:
    le = LabelEncoder()
    le.fit(train[c])
    train[c] = le.transform(train[c])
    test[c] = le.transform(test[c])

enc = OneHotEncoder()
enc.fit(train[cat_features])
X_cat = enc.transform(train[cat_features])
X_t_cat = enc.transform(test[cat_features])

ind_features = [c for c in feature_names if 'ind' in c]
count=0
for c in ind_features:
    if count==0:
        train['new_ind'] = train[c].astype(str)+'_'
        test['new_ind'] = test[c].astype(str)+'_'
        count+=1
    else:
        train['new_ind'] += train[c].astype(str)+'_'
        test['new_ind'] += test[c].astype(str)+'_'

cat_count_features = []
for c in cat_features+['new_ind']:
    d = pd.concat([train[c],test[c]]).value_counts().to_dict()
    train['%s_count'%c] = train[c].apply(lambda x:d.get(x,0))
    test['%s_count'%c] = test[c].apply(lambda x:d.get(x,0))
    cat_count_features.append('%s_count'%c)

train_list = [train[num_features+cat_count_features].values,X_cat,]
test_list = [test[num_features+cat_count_features].values,X_t_cat,]

X = ssp.hstack(train_list).tocsr()
X_test = ssp.hstack(test_list).tocsr()

learning_rate = 0.1
num_leaves = 15
min_data_in_leaf = 2000
feature_fraction = 0.6
num_boost_round = 10000
params = {"objective": "binary",
          "boosting_type": "gbdt",
          "learning_rate": learning_rate,
          "num_leaves": num_leaves,
           "max_bin": 256,
          "feature_fraction": feature_fraction,
          "verbosity": 0,
          "drop_rate": 0.1,
          "is_unbalance": False,
          "max_drop": 50,
          "min_child_samples": 10,
          "min_child_weight": 150,
          "min_split_gain": 0,
          "subsample": 0.9
          }

x_score = []
final_cv_train = np.zeros(len(train_label))
final_cv_pred = np.zeros(len(test_id))
for s in xrange(16):
    cv_train = np.zeros(len(train_label))
    cv_pred = np.zeros(len(test_id))

    params['seed'] = s

    if cv_only:
        kf = kfold.split(X, train_label)

        best_trees = []
        fold_scores = []

        for i, (train_fold, validate) in enumerate(kf):
            X_train, X_validate, label_train, label_validate = \
                X[train_fold, :], X[validate, :], train_label[train_fold], train_label[validate]
            dtrain = lgbm.Dataset(X_train, label_train)
            dvalid = lgbm.Dataset(X_validate, label_validate, reference=dtrain)
            bst = lgbm.train(params, dtrain, num_boost_round, valid_sets=dvalid, feval=evalerror, verbose_eval=100,
                            early_stopping_rounds=100)
            best_trees.append(bst.best_iteration)
            cv_pred += bst.predict(X_test, num_iteration=bst.best_iteration)
            cv_train[validate] += bst.predict(X_validate)

            score = Gini(label_validate, cv_train[validate])
            print score
            fold_scores.append(score)

        cv_pred /= NFOLDS
        final_cv_train += cv_train
        final_cv_pred += cv_pred

        print("cv score:")
        print Gini(train_label, cv_train)
        print "current score:", Gini(train_label, final_cv_train / (s + 1.)), s+1
        print(fold_scores)
        print(best_trees, np.mean(best_trees))

        x_score.append(Gini(train_label, cv_train))

print(x_score)
pd.DataFrame({'id': test_id, 'target': final_cv_pred / 16.}).to_csv('../model/lgbm3_pred_avg.csv', index=False)
pd.DataFrame({'id': train_id, 'target': final_cv_train / 16.}).to_csv('../model/lgbm3_cv_avg.csv', index=False)



================================================
FILE: code/nn_model290.py
================================================
from keras.layers import Dense, Dropout, Embedding, Flatten, Input, merge
from keras.layers.normalization import BatchNormalization
from keras.layers.advanced_activations import PReLU
from time import time
import datetime
from keras.models import Model
from sklearn.model_selection import StratifiedKFold
import numpy as np
import pandas as pd
from util import Gini, interaction_features
from itertools import combinations
from util import proj_num_on_cat
from scipy import sparse
from sklearn.preprocessing import StandardScaler
import pickle
from sklearn.preprocessing import LabelEncoder

cv_only = True
save_cv = True

NFOLDS = 5
kfold = StratifiedKFold(n_splits=NFOLDS, shuffle=True, random_state=218)

train = pd.read_csv("../input/train.csv")
train_label = train['target']
train_id = train['id']
del train['target'], train['id']

test = pd.read_csv("../input/test.csv")
test_id = test['id']
del test['id']

cat_fea = [x for x in list(train) if 'cat' in x]
bin_fea = [x for x in list(train) if 'bin' in x]

train['missing'] = (train==-1).sum(axis=1).astype(float)
test['missing'] = (test==-1).sum(axis=1).astype(float)

# include interactions
for e, (x, y) in enumerate(combinations(['ps_car_13', 'ps_ind_03', 'ps_reg_03', 'ps_ind_15', 'ps_reg_01', 'ps_ind_01'], 2)):
    train, test = interaction_features(train, test, x, y, e)

num_features = [c for c in list(train) if ('cat' not in c and 'calc' not in c)]
num_features.append('missing')
inter_fea = [x for x in list(train) if 'inter' in x]

feature_names = list(train)
ind_features = [c for c in feature_names if 'ind' in c]
count = 0
for c in ind_features:
    if count == 0:
        train['new_ind'] = train[c].astype(str)
        count += 1
    else:
        train['new_ind'] += '_' + train[c].astype(str)

ind_features = [c for c in feature_names if 'ind' in c]
count = 0
for c in ind_features:
    if count == 0:
        test['new_ind'] = test[c].astype(str)
        count += 1
    else:
        test['new_ind'] += '_' + test[c].astype(str)

reg_features = [c for c in feature_names if 'reg' in c]
count = 0
for c in reg_features:
    if count == 0:
        train['new_reg'] = train[c].astype(str)
        count += 1
    else:
        train['new_reg'] += '_' + train[c].astype(str)

reg_features = [c for c in feature_names if 'reg' in c]
count = 0
for c in reg_features:
    if count == 0:
        test['new_reg'] = test[c].astype(str)
        count += 1
    else:
        test['new_reg'] += '_' + test[c].astype(str)

car_features = [c for c in feature_names if 'car' in c]
count = 0
for c in car_features:
    if count == 0:
        train['new_car'] = train[c].astype(str)
        count += 1
    else:
        train['new_car'] += '_' + train[c].astype(str)

car_features = [c for c in feature_names if 'car' in c]
count = 0
for c in car_features:
    if count == 0:
        test['new_car'] = test[c].astype(str)
        count += 1
    else:
        test['new_car'] += '_' + test[c].astype(str)

train_cat = train[cat_fea]
train_num = train[[x for x in list(train) if x in num_features]]
test_cat = test[cat_fea]
test_num = test[[x for x in list(train) if x in num_features]]

max_cat_values = []
for c in cat_fea:
    le = LabelEncoder()
    x = le.fit_transform(pd.concat([train_cat, test_cat])[c])
    train_cat[c] = le.transform(train_cat[c])
    test_cat[c] = le.transform(test_cat[c])
    max_cat_values.append(np.max(x))

# xgboost prediction
train_fea0, test_fea0 = pickle.load(open("../input/fea0.pk"))

cat_count_features = []
for c in cat_fea + ['new_ind','new_reg','new_car']:
    d = pd.concat([train[c],test[c]]).value_counts().to_dict()
    train['%s_count'%c] = train[c].apply(lambda x:d.get(x,0))
    test['%s_count'%c] = test[c].apply(lambda x:d.get(x,0))
    cat_count_features.append('%s_count'%c)


print(train_num.dtypes)
train_list = [train_num.replace([np.inf, -np.inf, np.nan], 0), train[cat_count_features], train_fea0]
test_list = [test_num.replace([np.inf, -np.inf, np.nan], 0), test[cat_count_features], test_fea0]

#feature aggregation
for t in ['ps_car_13', 'ps_ind_03', 'ps_reg_03', 'ps_ind_15', 'ps_reg_01', 'ps_ind_01']:
    for g in ['ps_car_13', 'ps_ind_03', 'ps_reg_03', 'ps_ind_15', 'ps_reg_01', 'ps_ind_01', 'ps_ind_05_cat']:
        if t != g:
            s_train, s_test = proj_num_on_cat(train, test, target_column=t, group_column=g)
            train_list.append(s_train)
            test_list.append(s_test)
X = sparse.hstack(train_list).tocsr()
X_test = sparse.hstack(test_list).tocsr()

all_data = np.vstack([X.toarray(), X_test.toarray()])
scaler = StandardScaler()
scaler.fit(all_data)
X = scaler.transform(X.toarray())
X_test = scaler.transform(X_test.toarray())
print(X.shape, X_test.shape)


cv_train = np.zeros(len(train_label))
cv_pred = np.zeros(len(test_id))

X_cat = train_cat.as_matrix()
X_test_cat = test_cat.as_matrix()

x_test_cat = []
for i in xrange(X_test_cat.shape[1]):
    x_test_cat.append(X_test_cat[:, i].reshape(-1, 1))
x_test_cat.append(X_test)

def nn_model():
    inputs = []
    flatten_layers = []
    for e, c in enumerate(cat_fea):
        input_c = Input(shape=(1, ), dtype='int32')
        num_c = max_cat_values[e]
        embed_c = Embedding(
            num_c,
            6,
            input_length=1
        )(input_c)
        embed_c = Dropout(0.25)(embed_c)
        flatten_c = Flatten()(embed_c)

        inputs.append(input_c)
        flatten_layers.append(flatten_c)

    input_num = Input(shape=(X.shape[1],), dtype='float32')
    flatten_layers.append(input_num)
    inputs.append(input_num)

    flatten = merge(flatten_layers, mode='concat')

    fc1 = Dense(512, init='he_normal')(flatten)
    fc1 = PReLU()(fc1)
    fc1 = BatchNormalization()(fc1)
    fc1 = Dropout(0.75)(fc1)

    fc1 = Dense(64, init='he_normal')(fc1)
    fc1 = PReLU()(fc1)
    fc1 = BatchNormalization()(fc1)
    fc1 = Dropout(0.5)(fc1)

    outputs = Dense(1, init='he_normal', activation='sigmoid')(fc1)

    model = Model(input = inputs, output = outputs)
    model.compile(loss='binary_crossentropy', optimizer='adam')
    return (model)

num_seeds = 5
begintime = time()
if cv_only:
    for s in xrange(num_seeds):
        np.random.seed(s)
        for (inTr, inTe) in kfold.split(X, train_label):
            xtr = X[inTr]
            ytr = train_label[inTr]
            xte = X[inTe]
            yte = train_label[inTe]

            xtr_cat = X_cat[inTr]
            xte_cat = X_cat[inTe]

            # get xtr xte cat
            xtr_cat_list, xte_cat_list = [], []
            for i in xrange(xtr_cat.shape[1]):
                xtr_cat_list.append(xtr_cat[:, i].reshape(-1, 1))
                xte_cat_list.append(xte_cat[:, i].reshape(-1, 1))

            xtr_cat_list.append(xtr)
            xte_cat_list.append(xte)

            model = nn_model()
            def get_rank(x):
                return pd.Series(x).rank(pct=True).values
            model.fit(xtr_cat_list, ytr, epochs=20, batch_size=512, verbose=2, validation_data=[xte_cat_list, yte])
            cv_train[inTe] += get_rank(model.predict(x=xte_cat_list, batch_size=512, verbose=0)[:, 0])
            print(Gini(train_label[inTe], cv_train[inTe]))
            cv_pred += get_rank(model.predict(x=x_test_cat, batch_size=512, verbose=0)[:, 0])
        print(s)
        print(Gini(train_label, cv_train / (1. * (s + 1))))
        print(str(datetime.timedelta(seconds=time() - begintime)))
    if save_cv:
        pd.DataFrame({'id': test_id, 'target': get_rank(cv_pred * 1./ (NFOLDS * num_seeds))}).to_csv('../model/keras5_pred.csv', index=False)
        pd.DataFrame({'id': train_id, 'target': get_rank(cv_train * 1. / num_seeds)}).to_csv('../model/keras5_cv.csv', index=False)



================================================
FILE: code/simple_average.py
================================================
'''
simple average of two models to get 2nd place
'''
import pandas as pd
keras5_test = pd.read_csv("../model/keras5_pred.csv")
lgbm3_test = pd.read_csv("../model/lgbm3_pred_avg.csv")

def get_rank(x):
    return pd.Series(x).rank(pct=True).values

pd.DataFrame({'id': keras5_test['id'], 'target':
    get_rank(keras5_test['target']) * 0.5 + get_rank(keras5_test['target']) * 0.5}).to_csv(
    "../model/simple_average.csv", index = False)

================================================
FILE: code/util.py
================================================
import numpy as np
import pandas as pd

def Gini(y_true, y_pred):
    # check and get number of samples
    assert y_true.shape == y_pred.shape
    n_samples = y_true.shape[0]

    # sort rows on prediction column
    # (from largest to smallest)
    arr = np.array([y_true, y_pred]).transpose()
    true_order = arr[arr[:, 0].argsort()][::-1, 0]
    pred_order = arr[arr[:, 1].argsort()][::-1, 0]

    # get Lorenz curves
    L_true = np.cumsum(true_order) * 1. / np.sum(true_order)
    L_pred = np.cumsum(pred_order) * 1. / np.sum(pred_order)
    L_ones = np.linspace(1 / n_samples, 1, n_samples)

    # get Gini coefficients (area between curves)
    G_true = np.sum(L_ones - L_true)
    G_pred = np.sum(L_ones - L_pred)

    # normalize to true Gini coefficient
    return G_pred * 1. / G_true


def cat_count(train_df, test_df, cat_list):
    train_df['row_id'] = range(train_df.shape[0])
    test_df['row_id'] = range(test_df.shape[0])
    train_df['train'] = 1
    test_df['train'] = 0
    all_df = train_df[['row_id', 'train'] + cat_list].append(test_df[['row_id','train'] + cat_list])
    for e, cat in enumerate(cat_list):
        grouped = all_df[[cat]].groupby(cat)
        the_size = pd.DataFrame(grouped.size()).reset_index()
        the_size.columns = [cat, '{}_size'.format(cat)]
        all_df = pd.merge(all_df, the_size, how='left')

        selected_train = all_df[all_df['train'] == 1]
        selected_test = all_df[all_df['train'] == 0]
        selected_train.sort_values('row_id', inplace=True)
        selected_test.sort_values('row_id', inplace=True)
        selected_train.drop(['row_id', 'train'] + cat_list, axis=1, inplace=True)
        selected_test.drop(['row_id', 'train'] + cat_list, axis=1, inplace=True)

        selected_train, selected_test = np.array(selected_train), np.array(selected_test)
    print(selected_train.shape, selected_test.shape)
    return selected_train, selected_test


def proj_num_on_cat(train_df, test_df, target_column, group_column):
    """
    :param train_df: train data frame
    :param test_df:  test data frame
    :param target_column: name of numerical feature
    :param group_column: name of categorical feature
    """
    train_df['row_id'] = range(train_df.shape[0])
    test_df['row_id'] = range(test_df.shape[0])
    train_df['train'] = 1
    test_df['train'] = 0
    all_df = train_df[['row_id', 'train', target_column, group_column]].append(test_df[['row_id','train',
                                                                                        target_column, group_column]])
    grouped = all_df[[target_column, group_column]].groupby(group_column)
    the_size = pd.DataFrame(grouped.size()).reset_index()
    the_size.columns = [group_column, '%s_size' % target_column]
    the_mean = pd.DataFrame(grouped.mean()).reset_index()
    the_mean.columns = [group_column, '%s_mean' % target_column]
    the_std = pd.DataFrame(grouped.std()).reset_index().fillna(0)
    the_std.columns = [group_column, '%s_std' % target_column]
    the_median = pd.DataFrame(grouped.median()).reset_index()
    the_median.columns = [group_column, '%s_median' % target_column]
    the_stats = pd.merge(the_size, the_mean)
    the_stats = pd.merge(the_stats, the_std)
    the_stats = pd.merge(the_stats, the_median)

    the_max = pd.DataFrame(grouped.max()).reset_index()
    the_max.columns = [group_column, '%s_max' % target_column]
    the_min = pd.DataFrame(grouped.min()).reset_index()
    the_min.columns = [group_column, '%s_min' % target_column]

    the_stats = pd.merge(the_stats, the_max)
    the_stats = pd.merge(the_stats, the_min)

    all_df = pd.merge(all_df, the_stats, how='left')

    selected_train = all_df[all_df['train'] == 1]
    selected_test = all_df[all_df['train'] == 0]
    selected_train.sort_values('row_id', inplace=True)
    selected_test.sort_values('row_id', inplace=True)
    selected_train.drop([target_column, group_column, 'row_id', 'train'], axis=1, inplace=True)
    selected_test.drop([target_column, group_column, 'row_id', 'train'], axis=1, inplace=True)

    selected_train, selected_test = np.array(selected_train), np.array(selected_test)
    print(selected_train.shape, selected_test.shape)
    return selected_train, selected_test


def interaction_features(train, test, fea1, fea2, prefix):
    train['inter_{}*'.format(prefix)] = train[fea1] * train[fea2]
    train['inter_{}/'.format(prefix)] = train[fea1] / train[fea2]

    test['inter_{}*'.format(prefix)] = test[fea1] * test[fea2]
    test['inter_{}/'.format(prefix)] = test[fea1] / test[fea2]

    return train, test


================================================
FILE: code_for_exact_solution/keras3.py
================================================
import os
import sys
import operator
import numpy as np
import pandas as pd
from scipy import sparse
import xgboost as xgb
from sklearn import model_selection, preprocessing, ensemble
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation
from keras.layers.normalization import BatchNormalization
from keras.layers.advanced_activations import PReLU
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import StratifiedKFold
from time import time
import datetime
from util import Gini, proj_num_on_cat, interaction_features, cat_count
from sklearn.preprocessing import StandardScaler
from itertools import combinations
import pickle

'''
simple xgboost benchmark
'''
from sklearn.model_selection import StratifiedKFold, KFold
from sklearn.feature_selection import SelectPercentile, f_classif
import numpy as np
import pandas as pd
from util import Gini, proj_num_on_cat, interaction_features, cat_count
from sklearn.preprocessing import LabelEncoder
from itertools import combinations
from util import proj_num_on_cat
from sklearn.preprocessing import OneHotEncoder
import category_encoders as ce
from scipy import sparse
from sklearn.linear_model import LogisticRegression as LR
from sklearn.preprocessing import StandardScaler
from scipy.sparse import csr_matrix
import pickle
cv_only = True
save_cv = True
full_train = True

def evalerror(preds, dtrain):
    labels = dtrain.get_label()
    return 'gini', Gini(labels, preds)

NFOLDS = 5
kfold = StratifiedKFold(n_splits=NFOLDS, shuffle=True, random_state=218)

train = pd.read_csv("../input/train.csv")
train_label = train['target']
train_id = train['id']
del train['target'], train['id']

test = pd.read_csv("../input/test.csv")
test_id = test['id']
del test['id']

cat_fea = [x for x in list(train) if 'cat' in x]
bin_fea = [x for x in list(train) if 'bin' in x]

train['missing'] = (train==-1).sum(axis=1).astype(float)
test['missing'] = (test==-1).sum(axis=1).astype(float)

# include interactions
for e, (x, y) in enumerate(combinations(['ps_car_13', 'ps_ind_03', 'ps_reg_03', 'ps_ind_15', 'ps_reg_01', 'ps_ind_01'], 2)):
    train, test = interaction_features(train, test, x, y, e)

num_features = [c for c in list(train) if ('cat' not in c and 'calc' not in c)]
num_features.append('missing')
inter_fea = [x for x in list(train) if 'inter' in x]
#train['cat_sum'] = train[cat_fea].sum(axis=1)
#test['cat_sum'] = test[cat_fea].sum(axis=1)


#X = train.as_matrix()
#X_test = test.as_matrix()
#print(X.shape, X_test.shape)
#ohe
ohe = OneHotEncoder(sparse=True)

train_cat = train[cat_fea].as_matrix()
train_num = train[[x for x in list(train) if x in num_features]]
test_cat = test[cat_fea].as_matrix()
test_num = test[[x for x in list(train) if x in num_features]]
train_cat[train_cat < 0] = 99
test_cat[test_cat < 0] = 99

traintest = np.vstack((train_cat, test_cat))
traintest = pd.DataFrame(traintest, columns=cat_fea)
print(traintest.shape)
#encoder = ce.HelmertEncoder(cols=cat_fea)
#encoder.fit(traintest)
#train_enc = encoder.transform(pd.DataFrame(train_cat, columns=cat_fea))
#test_enc = encoder.transform(pd.DataFrame(test_cat, columns=cat_fea))
ohe.fit(traintest)
train_ohe = ohe.transform(train_cat)
test_ohe = ohe.transform(test_cat)
del traintest

train_fea0, test_fea0 = pickle.load(open("../input/fea0.pk"))

cat_count_features = []
for c in cat_fea:
    d = pd.concat([train[c],test[c]]).value_counts().to_dict()
    train['%s_count'%c] = train[c].apply(lambda x:d.get(x,0))
    test['%s_count'%c] = test[c].apply(lambda x:d.get(x,0))
    cat_count_features.append('%s_count'%c)

train_list = [train_num.replace([np.inf, -np.inf, np.nan], 0), train_ohe, train[cat_count_features], train_fea0]#, np.ones(shape=(train_num.shape[0], 1))]
test_list = [test_num.replace([np.inf, -np.inf, np.nan], 0), test_ohe, test[cat_count_features], test_fea0]#, np.ones(shape=(test_num.shape[0], 1))]

#proj
for t in ['ps_car_13', 'ps_ind_03', 'ps_reg_03', 'ps_ind_15', 'ps_reg_01', 'ps_ind_01']:
    for g in ['ps_car_13', 'ps_ind_03', 'ps_reg_03', 'ps_ind_15', 'ps_reg_01', 'ps_ind_01', 'ps_ind_05_cat']:
        if t != g:
            s_train, s_test = proj_num_on_cat(train, test, target_column=t, group_column=g)
            train_list.append(s_train)
            test_list.append(s_test)
X = sparse.hstack(train_list).tocsr()
X_test = sparse.hstack(test_list).tocsr()
#X = train_num
#X_test = test_num
all_data = np.vstack([X.toarray(), X_test.toarray()])
#all_data = np.vstack([X, X_test])

scaler = StandardScaler()
scaler.fit(all_data)
X = scaler.transform(X.toarray())
X_test = scaler.transform(X_test.toarray())
#X = scaler.transform(X)
#X_test = scaler.transform(X_test)

print(X.shape, X_test.shape)


cv_train = np.zeros(len(train_label))
cv_pred = np.zeros(len(test_id))

def nn_model():
    model = Sequential()

    model.add(Dense(512, input_dim=X.shape[1], init='he_normal'))
    model.add(PReLU())
    model.add(BatchNormalization())
    model.add(Dropout(0.9))

    model.add(Dense(64, init='he_normal'))
    model.add(PReLU())
    model.add(BatchNormalization())
    model.add(Dropout(0.8))

    model.add(Dense(1, init='he_normal', activation='sigmoid'))
    model.compile(loss='binary_crossentropy', optimizer='adam')
    return (model)

num_seeds = 5
begintime = time()
if cv_only:
    for s in xrange(num_seeds):
        np.random.seed(s)
        for (inTr, inTe) in kfold.split(X, train_label):
            xtr = X[inTr]
            ytr = train_label[inTr]
            xte = X[inTe]
            yte = train_label[inTe]

            model = nn_model()
            model.fit(xtr, ytr, epochs=35, batch_size=512, verbose=2, validation_data=[xte, yte])
            cv_train[inTe] += model.predict_proba(x=xte, batch_size=512, verbose=0)[:, 0]
            cv_pred += model.predict_proba(x=X_test, batch_size=512, verbose=0)[:, 0]
        print(s)
        print(Gini(train_label, cv_train / (1. * (s + 1))))
        print(str(datetime.timedelta(seconds=time() - begintime)))
    if save_cv:
        pd.DataFrame({'id': test_id, 'target': cv_pred * 1./ (NFOLDS * num_seeds)}).to_csv('../model/keras3_pred.csv', index=False)
        pd.DataFrame({'id': train_id, 'target': cv_train * 1. / num_seeds}).to_csv('../model/keras3_cv.csv', index=False)



================================================
FILE: code_for_exact_solution/keras6.py
================================================
import os
import sys
import operator
import numpy as np
import pandas as pd
from scipy import sparse
import xgboost as xgb
from sklearn import model_selection, preprocessing, ensemble
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation, Embedding, Flatten, concatenate, Input, merge
from keras.layers.normalization import BatchNormalization
from keras.layers.advanced_activations import PReLU
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import StratifiedKFold
from time import time
import datetime
from util import Gini, proj_num_on_cat, interaction_features, cat_count
from sklearn.preprocessing import StandardScaler
from itertools import combinations
import pickle
from keras.models import Model
'''
simple xgboost benchmark
'''
from sklearn.model_selection import StratifiedKFold, KFold
from sklearn.feature_selection import SelectPercentile, f_classif
import numpy as np
import pandas as pd
from util import Gini, proj_num_on_cat, interaction_features, cat_count
from sklearn.preprocessing import LabelEncoder
from itertools import combinations
from util import proj_num_on_cat
from sklearn.preprocessing import OneHotEncoder
import category_encoders as ce
from scipy import sparse
from sklearn.linear_model import LogisticRegression as LR
from sklearn.preprocessing import StandardScaler
from scipy.sparse import csr_matrix
import pickle
cv_only = True
save_cv = True
full_train = True

def evalerror(preds, dtrain):
    labels = dtrain.get_label()
    return 'gini', Gini(labels, preds)

NFOLDS = 5
kfold = StratifiedKFold(n_splits=NFOLDS, shuffle=True, random_state=218)

train = pd.read_csv("../input/train.csv")
train_label = train['target']
train_id = train['id']
del train['target'], train['id']

test = pd.read_csv("../input/test.csv")
test_id = test['id']
del test['id']

cat_fea = [x for x in list(train) if 'cat' in x]
bin_fea = [x for x in list(train) if 'bin' in x]

train['missing'] = (train==-1).sum(axis=1).astype(float)
test['missing'] = (test==-1).sum(axis=1).astype(float)

# include interactions
for e, (x, y) in enumerate(combinations(['ps_car_13', 'ps_ind_03', 'ps_reg_03', 'ps_ind_15', 'ps_reg_01', 'ps_ind_01'], 2)):
    train, test = interaction_features(train, test, x, y, e)

num_features = [c for c in list(train) if ('cat' not in c and 'calc' not in c)]
num_features.append('missing')
inter_fea = [x for x in list(train) if 'inter' in x]
#train['cat_sum'] = train[cat_fea].sum(axis=1)
#test['cat_sum'] = test[cat_fea].sum(axis=1)

path = "../input/"
num_features_comb = []
for p in os.listdir(path):
    if 'ps_reg_02___ps_car_07_cat' in p or 'ps_reg_01___ps_car_13___ps_car_15' in p:
        print(p)
        x,xt = pd.read_pickle(path+p)
        train[p] = x
        test[p] = xt
        num_features_comb.append(p)

num_features += num_features_comb

feature_names = list(train)
ind_features = [c for c in feature_names if 'ind' in c]
count = 0
for c in ind_features:
    if count == 0:
        train['new_ind'] = train[c].astype(str)
        count += 1
    else:
        train['new_ind'] += '_' + train[c].astype(str)

print(train['new_ind'].nunique())
ind_features = [c for c in feature_names if 'ind' in c]
count = 0
for c in ind_features:
    if count == 0:
        test['new_ind'] = test[c].astype(str)
        count += 1
    else:
        test['new_ind'] += '_' + test[c].astype(str)

reg_features = [c for c in feature_names if 'reg' in c]
count = 0
for c in reg_features:
    if count == 0:
        train['new_reg'] = train[c].astype(str)
        count += 1
    else:
        train['new_reg'] += '_' + train[c].astype(str)

print(train['new_reg'].nunique())
reg_features = [c for c in feature_names if 'reg' in c]
count = 0
for c in reg_features:
    if count == 0:
        test['new_reg'] = test[c].astype(str)
        count += 1
    else:
        test['new_reg'] += '_' + test[c].astype(str)

car_features = [c for c in feature_names if 'car' in c]
count = 0
for c in car_features:
    if count == 0:
        train['new_car'] = train[c].astype(str)
        count += 1
    else:
        train['new_car'] += '_' + train[c].astype(str)

print(train['new_car'].nunique())
car_features = [c for c in feature_names if 'car' in c]
count = 0
for c in car_features:
    if count == 0:
        test['new_car'] = test[c].astype(str)
        count += 1
    else:
        test['new_car'] += '_' + test[c].astype(str)

new_ps_reg_03 = pd.read_pickle(path + 'new_ps_reg_03.pkl')
train['ps_reg_03'] = new_ps_reg_03[:train.shape[0]]
test['ps_reg_03'] = new_ps_reg_03[train.shape[0]:]
print(train['ps_reg_03'].head(10))


#X = train.as_matrix()
#X_test = test.as_matrix()
#print(X.shape, X_test.shape)
#ohe
from sklearn.preprocessing import LabelEncoder

train_cat = train[cat_fea]
train_num = train[[x for x in list(train) if x in num_features]]
test_cat = test[cat_fea]
test_num = test[[x for x in list(train) if x in num_features]]

max_cat_values = []
for c in cat_fea:
    le = LabelEncoder()
    x = le.fit_transform(pd.concat([train_cat, test_cat])[c])
    train_cat[c] = le.transform(train_cat[c])
    test_cat[c] = le.transform(test_cat[c])
    max_cat_values.append(np.max(x))


#train_fea0, test_fea0 = pickle.load(open("../input/fea0.pk"))

cat_count_features = []
for c in cat_fea + ['new_ind','new_reg','new_car']:
    d = pd.concat([train[c],test[c]]).value_counts().to_dict()
    train['%s_count'%c] = train[c].apply(lambda x:d.get(x,0))
    test['%s_count'%c] = test[c].apply(lambda x:d.get(x,0))
    cat_count_features.append('%s_count'%c)


print(train_num.dtypes)
train_list = [train_num.replace([np.inf, -np.inf, np.nan], 0), train[cat_count_features]]#, np.ones(shape=(train_num.shape[0], 1))]
test_list = [test_num.replace([np.inf, -np.inf, np.nan], 0), test[cat_count_features]]#, np.ones(shape=(test_num.shape[0], 1))]

#proj
for t in ['ps_car_13', 'ps_ind_03', 'ps_reg_03', 'ps_ind_15', 'ps_reg_01', 'ps_ind_01']:
    for g in ['ps_car_13', 'ps_ind_03', 'ps_reg_03', 'ps_ind_15', 'ps_reg_01', 'ps_ind_01', 'ps_ind_05_cat']:
        if t != g:
            s_train, s_test = proj_num_on_cat(train, test, target_column=t, group_column=g)
            train_list.append(s_train)
            test_list.append(s_test)
X = sparse.hstack(train_list).tocsr()
X_test = sparse.hstack(test_list).tocsr()
#X = train_num
#X_test = test_num
all_data = np.vstack([X.toarray(), X_test.toarray()])
#all_data = np.vstack([X, X_test])

scaler = StandardScaler()
scaler.fit(all_data)
X = scaler.transform(X.toarray())
X_test = scaler.transform(X_test.toarray())
#X = scaler.transform(X)
#X_test = scaler.transform(X_test)

print(X.shape, X_test.shape)


cv_train = np.zeros(len(train_label))
cv_pred = np.zeros(len(test_id))

def nn_model():
    inputs = []
    flatten_layers = []
    for e, c in enumerate(cat_fea):
        input_c = Input(shape=(1, ), dtype='int32')
        num_c = max_cat_values[e]
        embed_c = Embedding(
            num_c,
            64,
            input_length=1
        )(embed_c)
        embed_c = Dropout(0.25)(embed_c)
        flatten_c = Flatten()(embed_c)

        inputs.append(input_c)
        flatten_layers.append(flatten_c)

    input_num = Input(shape=(X.shape[1],), dtype='float32')
    flatten_layers.append(input_num)
    inputs.append(input_num)

    flatten = merge(flatten_layers, mode='concat')

    fc1 = Dense(512, kernel_init='he_normal')(flatten)
    fc1 = PReLU()(fc1)
    fc1 = BatchNormalization()(fc1)
    fc1 = Dropout(0.8)(fc1)

    fc1 = Dense(64, kernel_init='he_normal')(fc1)
    fc1 = PReLU()(fc1)
    fc1 = BatchNormalization()(fc1)
    fc1 = Dropout(0.8)(fc1)

    outputs = Dense(1, kernel_init='he_normal', activation='sigmoid')(fc1)

    model = Model(input = inputs, output = outputs)
    model.compile(loss='binary_crossentropy', optimizer='adam')
    return (model)


X_cat = train_cat.as_matrix()
X_test_cat = test_cat.as_matrix()

x_test_cat = []
for i in xrange(X_test_cat.shape[1]):
    x_test_cat.append(X_test_cat[:, i].reshape(-1, 1))
x_test_cat.append(X_test)
num_seeds = 5

def nn_model():
    inputs = []
    flatten_layers = []
    for e, c in enumerate(cat_fea):
        input_c = Input(shape=(1, ), dtype='int32')
        num_c = max_cat_values[e]
        embed_c = Embedding(
            num_c,
            6,
            input_length=1
        )(input_c)
        embed_c = Dropout(0.25)(embed_c)
        flatten_c = Flatten()(embed_c)

        inputs.append(input_c)
        flatten_layers.append(flatten_c)

    input_num = Input(shape=(X.shape[1],), dtype='float32')
    flatten_layers.append(input_num)
    inputs.append(input_num)

    flatten = merge(flatten_layers, mode='concat')

    fc1 = Dense(512, init='he_normal')(flatten)
    fc1 = PReLU()(fc1)
    fc1 = BatchNormalization()(fc1)
    fc1 = Dropout(0.75)(fc1)

    fc1 = Dense(64, init='he_normal')(fc1)
    fc1 = PReLU()(fc1)
    fc1 = BatchNormalization()(fc1)
    fc1 = Dropout(0.5)(fc1)

    outputs = Dense(1, init='he_normal', activation='sigmoid')(fc1)

    model = Model(input = inputs, output = outputs)
    model.compile(loss='binary_crossentropy', optimizer='adam')
    return (model)

num_seeds = 5
begintime = time()
if cv_only:
    for s in xrange(num_seeds):
        np.random.seed(s)
        for (inTr, inTe) in kfold.split(X, train_label):
            xtr = X[inTr]
            ytr = train_label[inTr]
            xte = X[inTe]
            yte = train_label[inTe]

            xtr_cat = X_cat[inTr]
            xte_cat = X_cat[inTe]

            # get xtr xte cat
            xtr_cat_list, xte_cat_list = [], []
            for i in xrange(xtr_cat.shape[1]):
                xtr_cat_list.append(xtr_cat[:, i].reshape(-1, 1))
                xte_cat_list.append(xte_cat[:, i].reshape(-1, 1))

            xtr_cat_list.append(xtr)
            xte_cat_list.append(xte)

            model = nn_model()
            def get_rank(x):
                return pd.Series(x).rank(pct=True).values
            model.fit(xtr_cat_list, ytr, epochs=20, batch_size=512, verbose=2, validation_data=[xte_cat_list, yte])
            cv_train[inTe] += get_rank(model.predict(x=xte_cat_list, batch_size=512, verbose=0)[:, 0])
            print(Gini(train_label[inTe], cv_train[inTe]))
            cv_pred += get_rank(model.predict(x=x_test_cat, batch_size=512, verbose=0)[:, 0])
        print(s)
        print(Gini(train_label, cv_train / (1. * (s + 1))))
        print(str(datetime.timedelta(seconds=time() - begintime)))
    if save_cv:
        pd.DataFrame({'id': test_id, 'target': get_rank(cv_pred * 1./ (NFOLDS * num_seeds))}).to_csv('../model/keras6_pred.csv', index=False)
        pd.DataFrame({'id': train_id, 'target': get_rank(cv_train * 1. / num_seeds)}).to_csv('../model/keras6_cv.csv', index=False)



================================================
FILE: code_for_exact_solution/keras7.py
================================================
import os
import sys
import operator
import numpy as np
import pandas as pd
from scipy import sparse
import xgboost as xgb
from sklearn import model_selection, preprocessing, ensemble
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation, Embedding, Flatten, concatenate, Input, merge
from keras.layers.normalization import BatchNormalization
from keras.layers.advanced_activations import PReLU
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import StratifiedKFold
from time import time
import datetime
from util import Gini, proj_num_on_cat, interaction_features, cat_count
from sklearn.preprocessing import StandardScaler
from itertools import combinations
import pickle
from keras.models import Model
'''
simple xgboost benchmark
'''
from sklearn.model_selection import StratifiedKFold, KFold
from sklearn.feature_selection import SelectPercentile, f_classif
import numpy as np
import pandas as pd
from util import Gini, proj_num_on_cat, interaction_features, cat_count
from sklearn.preprocessing import LabelEncoder
from itertools import combinations
from util import proj_num_on_cat
from sklearn.preprocessing import OneHotEncoder
import category_encoders as ce
from scipy import sparse
from sklearn.linear_model import LogisticRegression as LR
from sklearn.preprocessing import StandardScaler
from scipy.sparse import csr_matrix
import pickle
cv_only = True
save_cv = True
full_train = True

def evalerror(preds, dtrain):
    labels = dtrain.get_label()
    return 'gini', Gini(labels, preds)

NFOLDS = 5
kfold = StratifiedKFold(n_splits=NFOLDS, shuffle=True, random_state=218)

train = pd.read_csv("../input/train.csv")
train_label = train['target']
train_id = train['id']
del train['target'], train['id']

test = pd.read_csv("../input/test.csv")
test_id = test['id']
del test['id']

cat_fea = [x for x in list(train) if 'cat' in x]
bin_fea = [x for x in list(train) if 'bin' in x]

train['missing'] = (train==-1).sum(axis=1).astype(float)
test['missing'] = (test==-1).sum(axis=1).astype(float)

# include interactions
for e, (x, y) in enumerate(combinations(['ps_car_13', 'ps_ind_03', 'ps_reg_03', 'ps_ind_15', 'ps_reg_01', 'ps_ind_01'], 2)):
    train, test = interaction_features(train, test, x, y, e)

num_features = [c for c in list(train) if ('cat' not in c and 'calc' not in c)]
num_features.append('missing')
inter_fea = [x for x in list(train) if 'inter' in x]
#train['cat_sum'] = train[cat_fea].sum(axis=1)
#test['cat_sum'] = test[cat_fea].sum(axis=1)

path = "../input/"
num_features_comb = []
for p in os.listdir(path):
    if 'ps_reg_02___ps_car_07_cat' in p or 'ps_reg_01___ps_car_13___ps_car_15' in p:
        print(p)
        x,xt = pd.read_pickle(path+p)
        train[p] = x
        test[p] = xt
        num_features_comb.append(p)

num_features += num_features_comb

feature_names = list(train)
ind_features = [c for c in feature_names if 'ind' in c]
count = 0
for c in ind_features:
    if count == 0:
        train['new_ind'] = train[c].astype(str)
        count += 1
    else:
        train['new_ind'] += '_' + train[c].astype(str)

print(train['new_ind'].nunique())
ind_features = [c for c in feature_names if 'ind' in c]
count = 0
for c in ind_features:
    if count == 0:
        test['new_ind'] = test[c].astype(str)
        count += 1
    else:
        test['new_ind'] += '_' + test[c].astype(str)

reg_features = [c for c in feature_names if 'reg' in c]
count = 0
for c in reg_features:
    if count == 0:
        train['new_reg'] = train[c].astype(str)
        count += 1
    else:
        train['new_reg'] += '_' + train[c].astype(str)

print(train['new_reg'].nunique())
reg_features = [c for c in feature_names if 'reg' in c]
count = 0
for c in reg_features:
    if count == 0:
        test['new_reg'] = test[c].astype(str)
        count += 1
    else:
        test['new_reg'] += '_' + test[c].astype(str)

car_features = [c for c in feature_names if 'car' in c]
count = 0
for c in car_features:
    if count == 0:
        train['new_car'] = train[c].astype(str)
        count += 1
    else:
        train['new_car'] += '_' + train[c].astype(str)

print(train['new_car'].nunique())
car_features = [c for c in feature_names if 'car' in c]
count = 0
for c in car_features:
    if count == 0:
        test['new_car'] = test[c].astype(str)
        count += 1
    else:
        test['new_car'] += '_' + test[c].astype(str)

new_ps_reg_03 = pd.read_pickle(path + 'new_ps_reg_03.pkl')
train['ps_reg_03'] = new_ps_reg_03[:train.shape[0]]
test['ps_reg_03'] = new_ps_reg_03[train.shape[0]:]
print(train['ps_reg_03'].head(10))


#X = train.as_matrix()
#X_test = test.as_matrix()
#print(X.shape, X_test.shape)
#ohe
from sklearn.preprocessing import LabelEncoder

train_cat = train[cat_fea]
train_num = train[[x for x in list(train) if x in num_features]]
test_cat = test[cat_fea]
test_num = test[[x for x in list(train) if x in num_features]]

max_cat_values = []
for c in cat_fea:
    le = LabelEncoder()
    x = le.fit_transform(pd.concat([train_cat, test_cat])[c])
    train_cat[c] = le.transform(train_cat[c])
    test_cat[c] = le.transform(test_cat[c])
    max_cat_values.append(np.max(x))


train_fea0, test_fea0 = pickle.load(open("../input/fea0.pk"))

cat_count_features = []
for c in cat_fea + ['new_ind','new_reg','new_car']:
    d = pd.concat([train[c],test[c]]).value_counts().to_dict()
    train['%s_count'%c] = train[c].apply(lambda x:d.get(x,0))
    test['%s_count'%c] = test[c].apply(lambda x:d.get(x,0))
    cat_count_features.append('%s_count'%c)


print(train_num.dtypes)
train_list = [train_num.replace([np.inf, -np.inf, np.nan], 0), train[cat_count_features], train_fea0]#, np.ones(shape=(train_num.shape[0], 1))]
test_list = [test_num.replace([np.inf, -np.inf, np.nan], 0), test[cat_count_features], test_fea0]#, np.ones(shape=(test_num.shape[0], 1))]

#proj
for t in ['ps_car_13', 'ps_ind_03', 'ps_reg_03', 'ps_ind_15', 'ps_reg_01', 'ps_ind_01']:
    for g in ['ps_car_13', 'ps_ind_03', 'ps_reg_03', 'ps_ind_15', 'ps_reg_01', 'ps_ind_01', 'ps_ind_05_cat']:
        if t != g:
            s_train, s_test = proj_num_on_cat(train, test, target_column=t, group_column=g)
            train_list.append(s_train)
            test_list.append(s_test)
X = sparse.hstack(train_list).tocsr()
X_test = sparse.hstack(test_list).tocsr()
#X = train_num
#X_test = test_num
all_data = np.vstack([X.toarray(), X_test.toarray()])
#all_data = np.vstack([X, X_test])

scaler = StandardScaler()
scaler.fit(all_data)
X = scaler.transform(X.toarray())
X_test = scaler.transform(X_test.toarray())
#X = scaler.transform(X)
#X_test = scaler.transform(X_test)

print(X.shape, X_test.shape)


cv_train = np.zeros(len(train_label))
cv_pred = np.zeros(len(test_id))

X_cat = train_cat.as_matrix()
X_test_cat = test_cat.as_matrix()

x_test_cat = []
for i in xrange(X_test_cat.shape[1]):
    x_test_cat.append(X_test_cat[:, i].reshape(-1, 1))
x_test_cat.append(X_test)

X_cat = train_cat.as_matrix()
X_test_cat = test_cat.as_matrix()

x_test_cat = []
for i in xrange(X_test_cat.shape[1]):
    x_test_cat.append(X_test_cat[:, i].reshape(-1, 1))
x_test_cat.append(X_test)
num_seeds = 5

def nn_model():
    inputs = []
    flatten_layers = []
    for e, c in enumerate(cat_fea):
        input_c = Input(shape=(1, ), dtype='int32')
        num_c = max_cat_values[e]
        embed_c = Embedding(
            num_c,
            3,
            input_length=1
        )(input_c)
        embed_c = Dropout(0.)(embed_c)
        flatten_c = Flatten()(embed_c)

        inputs.append(input_c)
        flatten_layers.append(flatten_c)

    input_num = Input(shape=(X.shape[1],), dtype='float32')
    flatten_layers.append(input_num)
    inputs.append(input_num)

    flatten = merge(flatten_layers, mode='concat')

    fc1 = Dense(512, init='he_normal')(flatten)
    fc1 = PReLU()(fc1)
    fc1 = BatchNormalization()(fc1)
    fc1 = Dropout(0.8)(fc1)

    fc1 = Dense(64, init='he_normal')(fc1)
    fc1 = PReLU()(fc1)
    fc1 = BatchNormalization()(fc1)
    fc1 = Dropout(0.8)(fc1)

    outputs = Dense(1, init='he_normal', activation='sigmoid')(fc1)

    model = Model(input = inputs, output = outputs)
    model.compile(loss='binary_crossentropy', optimizer='adam')
    return (model)

num_seeds = 3
begintime = time()
if cv_only:
    for s in xrange(num_seeds):
        np.random.seed(s)
        for (inTr, inTe) in kfold.split(X, train_label):
            xtr = X[inTr]
            ytr = train_label[inTr]
            xte = X[inTe]
            yte = train_label[inTe]

            xtr_cat = X_cat[inTr]
            xte_cat = X_cat[inTe]

            # get xtr xte cat
            xtr_cat_list, xte_cat_list = [], []
            for i in xrange(xtr_cat.shape[1]):
                xtr_cat_list.append(xtr_cat[:, i].reshape(-1, 1))
                xte_cat_list.append(xte_cat[:, i].reshape(-1, 1))

            xtr_cat_list.append(xtr)
            xte_cat_list.append(xte)

            model = nn_model()
            def get_rank(x):
                return pd.Series(x).rank(pct=True).values
            model.fit(xtr_cat_list, ytr, epochs=30, batch_size=512, verbose=2, validation_data=[xte_cat_list, yte])
            cv_train[inTe] += get_rank(model.predict(x=xte_cat_list, batch_size=512, verbose=0)[:, 0])
            print(Gini(train_label[inTe], cv_train[inTe]))
            cv_pred += get_rank(model.predict(x=x_test_cat, batch_size=512, verbose=0)[:, 0])
        print(s)
        print(Gini(train_label, cv_train / (1. * (s + 1))))
        print(str(datetime.timedelta(seconds=time() - begintime)))
        pd.DataFrame({'id': test_id, 'target': get_rank(cv_pred * 1./ (NFOLDS * num_seeds))}).to_csv('../model/keras7_pred_0.csv', index=False)
        pd.DataFrame({'id': train_id, 'target': get_rank(cv_train * 1. / num_seeds)}).to_csv('../model/keras7_cv_0.csv', index=False)

    if save_cv:
        pd.DataFrame({'id': test_id, 'target': get_rank(cv_pred * 1./ (NFOLDS * num_seeds))}).to_csv('../model/keras7_pred.csv', index=False)
        pd.DataFrame({'id': train_id, 'target': get_rank(cv_train * 1. / num_seeds)}).to_csv('../model/keras7_cv.csv', index=False)


================================================
FILE: code_for_exact_solution/lightgbm1.py
================================================
'''
simple xgboost benchmark
'''
import lightgbm as lgbm
from sklearn.model_selection import StratifiedKFold
import numpy as np
import pandas as pd
from util import Gini
from sklearn.preprocessing import LabelEncoder
from itertools import combinations
from util import proj_num_on_cat, cat_count
from sklearn.preprocessing import OneHotEncoder
from scipy import sparse

cv_only = True
save_cv = True
full_train = False

def evalerror(preds, dtrain):
    labels = dtrain.get_label()
    return 'gini', Gini(labels, preds), True

NFOLDS = 5
kfold = StratifiedKFold(n_splits=NFOLDS, shuffle=True, random_state=218)
learning_rate = 0.1
num_leaves = 15
min_data_in_leaf = 2000
# max_bin = x
feature_fraction = 0.6
num_boost_round = 10000

train = pd.read_csv("../input/train.csv")
train_label = train['target']
train_id = train['id']
del train['target'], train['id']

test = pd.read_csv("../input/test.csv")
test_id = test['id']
del test['id']
cat_list = [x for x in list(train) if 'cat' in x]
print(cat_list)

train_copy = train.copy()
test_copy = test.copy()
train_copy = train_copy.replace(-1, np.NaN)
test_copy = test_copy.replace(-1, np.NaN)
train['num_na'] = train_copy.isnull().sum(axis=1)
test['num_na'] = test_copy.isnull().sum(axis=1)
del train_copy, test_copy

#X = train.as_matrix()
#X_test = test.as_matrix()
#print(X.shape, X_test.shape)
#ohe
ohe = OneHotEncoder(sparse=False)
train_cat = train[[x for x in list(train) if 'cat' in x]].as_matrix()
train_num = train[[x for x in list(train) if 'cat' not in x]]
test_cat = test[[x for x in list(test) if 'cat' in x]].as_matrix()
test_num = test[[x for x in list(test) if 'cat' not in x]]
train_cat[train_cat < 0] = 99
test_cat[test_cat < 0] = 99

train_ohe = ohe.fit_transform(train_cat)
test_ohe = ohe.transform(test_cat)


print("cat_list now:", cat_list)
train_cat_count, test_cat_count = cat_count(train, test, cat_list)
print("cat count shape:", train_cat_count.shape, test_cat_count.shape)

X = sparse.hstack([train_num, train_ohe, train_cat_count]).tocsr()
X_test = sparse.hstack([test_num, test_ohe, test_cat_count]).tocsr()
print(X.shape, X_test.shape)

params = {"objective": "binary",
          "boosting_type": "gbdt",
          "learning_rate": learning_rate,
          "num_leaves": int(num_leaves),
           "max_bin": 256,
          "min_data_in_leaf": min_data_in_leaf,
          "feature_fraction": feature_fraction,
          "verbosity": 0,
          "seed": 218,
          "drop_rate": 0.1,
          "is_unbalance": False,
          "max_drop": 50,
          "min_child_samples": 10,
          "min_child_weight": 150,
          "min_split_gain": 0,
          "subsample": 0.9
          }

x_score = []
final_cv_train = np.zeros(len(train_label))
final_cv_pred = np.zeros(len(test_id))
for s in xrange(16):
    cv_train = np.zeros(len(train_label))
    cv_pred = np.zeros(len(test_id))

    params['seed'] = s

    if cv_only:
        kf = kfold.split(X, train_label)

        best_trees = []
        fold_scores = []

        for i, (train_fold, validate) in enumerate(kf):
            X_train, X_validate, label_train, label_validate = \
                X[train_fold, :], X[validate, :], train_label[train_fold], train_label[validate]
            dtrain = lgbm.Dataset(X_train, label_train)
            dvalid = lgbm.Dataset(X_validate, label_validate, reference=dtrain)
            bst = lgbm.train(params, dtrain, num_boost_round, valid_sets=dvalid, feval=evalerror, verbose_eval=100,
                            early_stopping_rounds=100)
            best_trees.append(bst.best_iteration)
            cv_pred += bst.predict(X_test, num_iteration=bst.best_iteration)
            cv_train[validate] += bst.predict(X_validate)

            score = Gini(label_validate, cv_train[validate])
            print score
            fold_scores.append(score)

        cv_pred /= NFOLDS
        final_cv_train += cv_train
        final_cv_pred += cv_pred

        print("cv score:")
        print Gini(train_label, cv_train)
        print "current score:", Gini(train_label, final_cv_train / (s + 1.)), s+1
        print(fold_scores)
        print(best_trees, np.mean(best_trees))

        x_score.append(Gini(train_label, cv_train))

print(x_score)
pd.DataFrame({'id': test_id, 'target': final_cv_pred / 16.}).to_csv('../model/lgbm1_pred_avg.csv', index=False)
pd.DataFrame({'id': train_id, 'target': final_cv_train / 16.}).to_csv('../model/lgbm1_cv_avg.csv', index=False)

================================================
FILE: code_for_exact_solution/lightgbm5.py
================================================
import numpy as np
import scipy as sp
from scipy import sparse as ssp
import pandas as pd
from sklearn.preprocessing import normalize
from scipy.stats import spearmanr
import lightgbm as lgb
import lightgbm as lgbm
from sklearn.model_selection import StratifiedKFold
import numpy as np
import pandas as pd
from util import Gini
from sklearn.preprocessing import LabelEncoder
from itertools import combinations
from util import proj_num_on_cat
from sklearn.preprocessing import OneHotEncoder
from scipy import sparse
import lightgbm as lgbm
from sklearn.model_selection import StratifiedKFold
import numpy as np
import pandas as pd
from util import Gini
from sklearn.preprocessing import LabelEncoder
from itertools import combinations
from util import proj_num_on_cat, cat_count
from sklearn.preprocessing import OneHotEncoder
from scipy import sparse

cv_only = True
save_cv = True
full_train = False

import numpy as np
import scipy as sp
from scipy import sparse as ssp
import pandas as pd
import xgboost as xgb
from sklearn.model_selection import StratifiedKFold, KFold
import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from itertools import combinations
from util import proj_num_on_cat
from sklearn.preprocessing import OneHotEncoder
import category_encoders as ce
from scipy import sparse
from util import Gini, proj_num_on_cat, interaction_features, cat_count


def evalerror(preds, dtrain):
    labels = dtrain.get_label()
    return 'gini', Gini(labels, preds), True

path = "../input/"

train = pd.read_csv(path+'train.csv')
train_label = train['target']
train_id = train['id']
test = pd.read_csv(path+'test.csv')
test_id = test['id']

NFOLDS = 5
kfold = StratifiedKFold(n_splits=NFOLDS, shuffle=True, random_state=218)
learning_rate = 0.1
num_leaves = 15
min_data_in_leaf = 2000
# max_bin = x
feature_fraction = 0.6
num_boost_round = 10000

y = train['target'].values
drop_feature = [
    'id',
    'target'
]

X = train.drop(drop_feature,axis=1)
feature_names = X.columns.tolist()
cat_features = [c for c in feature_names if ('cat' in c and 'count' not in c)]
num_features = [c for c in feature_names if ('cat' not in c and 'calc' not in c)]

train['missing'] = (train==-1).sum(axis=1).astype(float)
test['missing'] = (test==-1).sum(axis=1).astype(float)
num_features.append('missing')

from sklearn.preprocessing import OneHotEncoder,LabelEncoder
for c in cat_features:
    le = LabelEncoder()
    le.fit(train[c])
    train[c] = le.transform(train[c])
    test[c] = le.transform(test[c])

from sklearn.preprocessing import normalize
from scipy.stats import spearmanr
from sklearn.preprocessing import OneHotEncoder
enc = OneHotEncoder()
enc.fit(train[cat_features])
X_cat = enc.transform(train[cat_features])
X_t_cat = enc.transform(test[cat_features])

ind_features = [c for c in feature_names if 'ind' in c]
count=0
for c in ind_features:
    if count==0:
        train['new_ind'] = train[c].astype(str)+'_'
        test['new_ind'] = test[c].astype(str)+'_'
        count+=1
    else:
        train['new_ind'] += train[c].astype(str)+'_'
        test['new_ind'] += test[c].astype(str)+'_'

cat_count_features = []
for c in cat_features+['new_ind']:
    d = pd.concat([train[c],test[c]]).value_counts().to_dict()
    train['%s_count'%c] = train[c].apply(lambda x:d.get(x,0))
    test['%s_count'%c] = test[c].apply(lambda x:d.get(x,0))
    cat_count_features.append('%s_count'%c)
    score = spearmanr(train['target'],train['%s_count'%c])
    print(c,score)

train_list = [train[num_features+cat_count_features].values,X_cat,]
test_list = [test[num_features+cat_count_features].values,X_t_cat,]

# missing binary projections
#missing_list = [x for x in list(train) if np.sum(train[x] == -1) > 0]
#for miss_fea in missing_list:
#    train['{}_miss_code'.format(miss_fea)] = (train[miss_fea] == -1).astype(int)
#    test['{}_miss_code'.format(miss_fea)] = (test[miss_fea] == -1).astype(int)

X = ssp.hstack(train_list).tocsr()
X_test = ssp.hstack(test_list).tocsr()


params = {"objective": "poisson",
          "boosting_type": "gbdt",
          "learning_rate": learning_rate,
          "num_leaves": int(num_leaves),
           "max_bin": 256,
          "feature_fraction": feature_fraction,
          "verbosity": 0,
          "drop_rate": 0.1,
          "is_unbalance": False,
          "max_drop": 50,
          "min_child_samples": 10,
          "min_child_weight": 150,
          "min_split_gain": 0,
          "subsample": 0.9
          }

x_score = []
final_cv_train = np.zeros(len(train_label))
final_cv_pred = np.zeros(len(test_id))
for s in xrange(10):
    cv_train = np.zeros(len(train_label))
    cv_pred = np.zeros(len(test_id))

    params['seed'] = s

    if cv_only:
        kf = kfold.split(X, train_label)

        best_trees = []
        fold_scores = []

        for i, (train_fold, validate) in enumerate(kf):
            X_train, X_validate, label_train, label_validate = \
                X[train_fold, :], X[validate, :], train_label[train_fold], train_label[validate]
            dtrain = lgbm.Dataset(X_train, label_train)
            dvalid = lgbm.Dataset(X_validate, label_validate, reference=dtrain)
            bst = lgbm.train(params, dtrain, num_boost_round, valid_sets=dvalid, feval=evalerror, verbose_eval=100,
                            early_stopping_rounds=100)
            best_trees.append(bst.best_iteration)
            cv_pred += bst.predict(X_test, num_iteration=bst.best_iteration)
            cv_train[validate] += bst.predict(X_validate)

            score = Gini(label_validate, cv_train[validate])
            print score
            fold_scores.append(score)

        cv_pred /= NFOLDS
        final_cv_train += cv_train
        final_cv_pred += cv_pred

        print("cv score:")
        print Gini(train_label, cv_train)
        print "current score:", Gini(train_label, final_cv_train / (s + 1.)), s+1
        print(fold_scores)
        print(best_trees, np.mean(best_trees))

        x_score.append(Gini(train_label, cv_train))

print(x_score)

pred_result = pd.DataFrame({'id': test_id, 'target': final_cv_pred / 10.})
pred_result['target'] = pred_result['target'].rank(pct=True)
pred_result.to_csv('../model/lgbm5_pred_avg.csv', index=False)

cv_result = pd.DataFrame({'id': train_id, 'target': final_cv_train / 10.})
cv_result['target'] = cv_result['target'].rank(pct=True)
cv_result.to_csv('../model/lgbm5_cv_avg.csv', index=False)

#cv score:
#0.287007087138
#current score: 0.289683837899 16


================================================
FILE: code_for_exact_solution/lightgbm6.py
================================================
import numpy as np
import scipy as sp
from scipy import sparse as ssp
import pandas as pd
from sklearn.preprocessing import normalize
from scipy.stats import spearmanr
import lightgbm as lgb
import lightgbm as lgbm
from sklearn.model_selection import StratifiedKFold
import numpy as np
import pandas as pd
from util import Gini
from sklearn.preprocessing import LabelEncoder
from itertools import combinations
from util import proj_num_on_cat
from sklearn.preprocessing import OneHotEncoder
from scipy import sparse
import lightgbm as lgbm
from sklearn.model_selection import StratifiedKFold
import numpy as np
import pandas as pd
from util import Gini
from sklearn.preprocessing import LabelEncoder
from itertools import combinations
from util import proj_num_on_cat, cat_count
from sklearn.preprocessing import OneHotEncoder
from scipy import sparse

cv_only = True
save_cv = True
full_train = False

import numpy as np
import scipy as sp
from scipy import sparse as ssp
import pandas as pd
import xgboost as xgb
from sklearn.model_selection import StratifiedKFold, KFold
import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from itertools import combinations
from util import proj_num_on_cat
from sklearn.preprocessing import OneHotEncoder
import category_encoders as ce
from scipy import sparse
from util import Gini, proj_num_on_cat, interaction_features, cat_count


def evalerror(preds, dtrain):
    labels = dtrain.get_label()
    return 'gini', Gini(labels, preds), True

path = "../input/"

train = pd.read_csv(path+'train.csv')
train_label = train['target']
train_id = train['id']
test = pd.read_csv(path+'test.csv')
test_id = test['id']

NFOLDS = 5
kfold = StratifiedKFold(n_splits=NFOLDS, shuffle=True, random_state=218)
learning_rate = 0.1
num_leaves = 15
min_data_in_leaf = 2000
# max_bin = x
feature_fraction = 0.6
num_boost_round = 10000

y = train['target'].values
drop_feature = [
    'id',
    'target'
]

X = train.drop(drop_feature,axis=1)
feature_names = X.columns.tolist()
cat_features = [c for c in feature_names if ('cat' in c and 'count' not in c)]
num_features = [c for c in feature_names if ('cat' not in c and 'calc' not in c)]

train['missing'] = (train==-1).sum(axis=1).astype(float)
test['missing'] = (test==-1).sum(axis=1).astype(float)
num_features.append('missing')

from sklearn.preprocessing import OneHotEncoder,LabelEncoder
for c in cat_features:
    le = LabelEncoder()
    le.fit(train[c])
    train[c] = le.transform(train[c])
    test[c] = le.transform(test[c])

from sklearn.preprocessing import normalize
from scipy.stats import spearmanr
from sklearn.preprocessing import OneHotEncoder
enc = OneHotEncoder()
enc.fit(train[cat_features])
X_cat = enc.transform(train[cat_features])
X_t_cat = enc.transform(test[cat_features])

ind_features = [c for c in feature_names if 'ind' in c]
count=0
for c in ind_features:
    if count==0:
        train['new_ind'] = train[c].astype(str)+'_'
        test['new_ind'] = test[c].astype(str)+'_'
        count+=1
    else:
        train['new_ind'] += train[c].astype(str)+'_'
        test['new_ind'] += test[c].astype(str)+'_'

cat_count_features = []
for c in cat_features+['new_ind']:
    d = pd.concat([train[c],test[c]]).value_counts().to_dict()
    train['%s_count'%c] = train[c].apply(lambda x:d.get(x,0))
    test['%s_count'%c] = test[c].apply(lambda x:d.get(x,0))
    cat_count_features.append('%s_count'%c)
    score = spearmanr(train['target'],train['%s_count'%c])
    print(c,score)

train_list = [train[num_features+cat_count_features].values,X_cat,]
test_list = [test[num_features+cat_count_features].values,X_t_cat,]

# missing binary projections
#missing_list = [x for x in list(train) if np.sum(train[x] == -1) > 0]
#for miss_fea in missing_list:
#    train['{}_miss_code'.format(miss_fea)] = (train[miss_fea] == -1).astype(int)
#    test['{}_miss_code'.format(miss_fea)] = (test[miss_fea] == -1).astype(int)

X = ssp.hstack(train_list).tocsr()
X_test = ssp.hstack(test_list).tocsr()


params = {"objective": "fair",
          "boosting_type": "gbdt",
          "learning_rate": learning_rate,
          "num_leaves": int(num_leaves),
           "max_bin": 256,
          "feature_fraction": feature_fraction,
          "verbosity": 0,
          "drop_rate": 0.1,
          "is_unbalance": False,
          "max_drop": 50,
          "min_child_samples": 10,
          "min_child_weight": 150,
          "min_split_gain": 0,
          "subsample": 0.9
          }

x_score = []
final_cv_train = np.zeros(len(train_label))
final_cv_pred = np.zeros(len(test_id))
for s in xrange(10):
    cv_train = np.zeros(len(train_label))
    cv_pred = np.zeros(len(test_id))

    params['seed'] = s

    if cv_only:
        kf = kfold.split(X, train_label)

        best_trees = []
        fold_scores = []

        for i, (train_fold, validate) in enumerate(kf):
            X_train, X_validate, label_train, label_validate = \
                X[train_fold, :], X[validate, :], train_label[train_fold], train_label[validate]
            dtrain = lgbm.Dataset(X_train, label_train)
            dvalid = lgbm.Dataset(X_validate, label_validate, reference=dtrain)
            bst = lgbm.train(params, dtrain, num_boost_round, valid_sets=dvalid, feval=evalerror, verbose_eval=100,
                            early_stopping_rounds=100)
            best_trees.append(bst.best_iteration)
            cv_pred += bst.predict(X_test, num_iteration=bst.best_iteration)
            cv_train[validate] += bst.predict(X_validate)

            score = Gini(label_validate, cv_train[validate])
            print score
            fold_scores.append(score)

        cv_pred /= NFOLDS
        final_cv_train += cv_train
        final_cv_pred += cv_pred

        print("cv score:")
        print Gini(train_label, cv_train)
        print "current score:", Gini(train_label, final_cv_train / (s + 1.)), s+1
        print(fold_scores)
        print(best_trees, np.mean(best_trees))

        x_score.append(Gini(train_label, cv_train))

print(x_score)

pred_result = pd.DataFrame({'id': test_id, 'target': final_cv_pred / 10.})
pred_result['target'] = pred_result['target'].rank(pct=True)
pred_result.to_csv('../model/lgbm6_pred_avg.csv', index=False)

cv_result = pd.DataFrame({'id': train_id, 'target': final_cv_train / 10.})
cv_result['target'] = cv_result['target'].rank(pct=True)
cv_result.to_csv('../model/lgbm6_cv_avg.csv', index=False)

#cv score:
#0.287007087138
#current score: 0.289683837899 16


================================================
FILE: code_for_exact_solution/lightgbm7.py
================================================
import numpy as np
import scipy as sp
from scipy import sparse as ssp
import pandas as pd
from sklearn.preprocessing import normalize
from scipy.stats import spearmanr
import lightgbm as lgb
import lightgbm as lgbm
from sklearn.model_selection import StratifiedKFold
import numpy as np
import pandas as pd
from util import Gini
from sklearn.preprocessing import LabelEncoder
from itertools import combinations
from util import proj_num_on_cat
from sklearn.preprocessing import OneHotEncoder
from scipy import sparse
import lightgbm as lgbm
from sklearn.model_selection import StratifiedKFold
import numpy as np
import pandas as pd
from util import Gini
from sklearn.preprocessing import LabelEncoder
from itertools import combinations
from util import proj_num_on_cat, cat_count
from sklearn.preprocessing import OneHotEncoder
from scipy import sparse

cv_only = True
save_cv = True
full_train = False

import numpy as np
import scipy as sp
from scipy import sparse as ssp
import pandas as pd
import xgboost as xgb
from sklearn.model_selection import StratifiedKFold, KFold
import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from itertools import combinations
from util import proj_num_on_cat
from sklearn.preprocessing import OneHotEncoder
import category_encoders as ce
from scipy import sparse
from util import Gini, proj_num_on_cat, interaction_features, cat_count


def evalerror(preds, dtrain):
    labels = dtrain.get_label()
    return 'gini', Gini(labels, preds), True

path = "../input/"

train = pd.read_csv(path+'train.csv')
train_label = train['target']
train_id = train['id']
test = pd.read_csv(path+'test.csv')
test_id = test['id']

NFOLDS = 5
kfold = StratifiedKFold(n_splits=NFOLDS, shuffle=True, random_state=218)
learning_rate = 0.1
num_leaves = 15
min_data_in_leaf = 2000
# max_bin = x
feature_fraction = 0.6
num_boost_round = 10000

y = train['target'].values
drop_feature = [
    'id',
    'target'
]

X = train.drop(drop_feature,axis=1)
feature_names = X.columns.tolist()
cat_features = [c for c in feature_names if ('cat' in c and 'count' not in c)]
num_features = [c for c in feature_names if ('cat' not in c and 'calc' not in c)]

train['missing'] = (train==-1).sum(axis=1).astype(float)
test['missing'] = (test==-1).sum(axis=1).astype(float)
num_features.append('missing')

from sklearn.preprocessing import OneHotEncoder,LabelEncoder
for c in cat_features:
    le = LabelEncoder()
    le.fit(train[c])
    train[c] = le.transform(train[c])
    test[c] = le.transform(test[c])

from sklearn.preprocessing import normalize
from scipy.stats import spearmanr
from sklearn.preprocessing import OneHotEncoder
enc = OneHotEncoder()
enc.fit(train[cat_features])
X_cat = enc.transform(train[cat_features])
X_t_cat = enc.transform(test[cat_features])

ind_features = [c for c in feature_names if 'ind' in c]
count=0
for c in ind_features:
    if count==0:
        train['new_ind'] = train[c].astype(str)+'_'
        test['new_ind'] = test[c].astype(str)+'_'
        count+=1
    else:
        train['new_ind'] += train[c].astype(str)+'_'
        test['new_ind'] += test[c].astype(str)+'_'

cat_count_features = []
for c in cat_features+['new_ind']:
    d = pd.concat([train[c],test[c]]).value_counts().to_dict()
    train['%s_count'%c] = train[c].apply(lambda x:d.get(x,0))
    test['%s_count'%c] = test[c].apply(lambda x:d.get(x,0))
    cat_count_features.append('%s_count'%c)
    score = spearmanr(train['target'],train['%s_count'%c])
    print(c,score)

train_list = [train[num_features+cat_count_features].values,X_cat,]
test_list = [test[num_features+cat_count_features].values,X_t_cat,]

# missing binary projections
#missing_list = [x for x in list(train) if np.sum(train[x] == -1) > 0]
#for miss_fea in missing_list:
#    train['{}_miss_code'.format(miss_fea)] = (train[miss_fea] == -1).astype(int)
#    test['{}_miss_code'.format(miss_fea)] = (test[miss_fea] == -1).astype(int)

X = ssp.hstack(train_list).tocsr()
X_test = ssp.hstack(test_list).tocsr()


params = {"objective": "binary",
          "boosting_type": "goss",
          "learning_rate": learning_rate,
          "num_leaves": int(num_leaves),
           "max_bin": 256,
          "feature_fraction": feature_fraction,
          "verbosity": 0,
          "drop_rate": 0.1,
          "is_unbalance": False,
          "max_drop": 50,
          "min_child_samples": 10,
          "min_child_weight": 150,
          "min_split_gain": 0,
          "subsample": 0.9
          }

x_score = []
final_cv_train = np.zeros(len(train_label))
final_cv_pred = np.zeros(len(test_id))
for s in xrange(10):
    cv_train = np.zeros(len(train_label))
    cv_pred = np.zeros(len(test_id))

    params['seed'] = s

    if cv_only:
        kf = kfold.split(X, train_label)

        best_trees = []
        fold_scores = []

        for i, (train_fold, validate) in enumerate(kf):
            X_train, X_validate, label_train, label_validate = \
                X[train_fold, :], X[validate, :], train_label[train_fold], train_label[validate]
            dtrain = lgbm.Dataset(X_train, label_train)
            dvalid = lgbm.Dataset(X_validate, label_validate, reference=dtrain)
            bst = lgbm.train(params, dtrain, num_boost_round, valid_sets=dvalid, feval=evalerror, verbose_eval=100,
                            early_stopping_rounds=100)
            best_trees.append(bst.best_iteration)
            cv_pred += bst.predict(X_test, num_iteration=bst.best_iteration)
            cv_train[validate] += bst.predict(X_validate)

            score = Gini(label_validate, cv_train[validate])
            print score
            fold_scores.append(score)

        cv_pred /= NFOLDS
        final_cv_train += cv_train
        final_cv_pred += cv_pred

        print("cv score:")
        print Gini(train_label, cv_train)
        print "current score:", Gini(train_label, final_cv_train / (s + 1.)), s+1
        print(fold_scores)
        print(best_trees, np.mean(best_trees))

        x_score.append(Gini(train_label, cv_train))

print(x_score)

pred_result = pd.DataFrame({'id': test_id, 'target': final_cv_pred / 10.})
pred_result['target'] = pred_result['target'].rank(pct=True)
pred_result.to_csv('../model/lgbm7_pred_avg.csv', index=False)

cv_result = pd.DataFrame({'id': train_id, 'target': final_cv_train / 10.})
cv_result['target'] = cv_result['target'].rank(pct=True)
cv_result.to_csv('../model/lgbm7_cv_avg.csv', index=False)

#cv score:
#0.287007087138
#current score: 0.289683837899 16


================================================
FILE: code_for_exact_solution/lightgbm8.py
================================================
import numpy as np
import scipy as sp
from scipy import sparse as ssp
import pandas as pd
from sklearn.preprocessing import normalize
from scipy.stats import spearmanr
import lightgbm as lgb
import lightgbm as lgbm
from sklearn.model_selection import StratifiedKFold
import numpy as np
import pandas as pd
from util import Gini
from sklearn.preprocessing import LabelEncoder
from itertools import combinations
from util import proj_num_on_cat
from sklearn.preprocessing import OneHotEncoder
from scipy import sparse
import lightgbm as lgbm
from sklearn.model_selection import StratifiedKFold
import numpy as np
import pandas as pd
from util import Gini
from sklearn.preprocessing import LabelEncoder
from itertools import combinations
from util import proj_num_on_cat, cat_count
from sklearn.preprocessing import OneHotEncoder
from scipy import sparse

cv_only = True
save_cv = True
full_train = False

import numpy as np
import scipy as sp
from scipy import sparse as ssp
import pandas as pd
import xgboost as xgb
from sklearn.model_selection import StratifiedKFold, KFold
import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from itertools import combinations
from util import proj_num_on_cat
from sklearn.preprocessing import OneHotEncoder
import category_encoders as ce
from scipy import sparse
from util import Gini, proj_num_on_cat, interaction_features, cat_count

NFOLDS = 5
kfold = StratifiedKFold(n_splits=NFOLDS, shuffle=True, random_state=218)
learning_rate = 0.1
num_leaves = 15
min_data_in_leaf = 2000
# max_bin = x
feature_fraction = 0.6
num_boost_round = 10000

cv_only = True
save_cv = True
full_train = True

def evalerror(preds, dtrain):
    labels = dtrain.get_label()
    return 'gini', Gini(labels, preds), True

NFOLDS = 5
kfold = StratifiedKFold(n_splits=NFOLDS, shuffle=True, random_state=218)

train = pd.read_csv("../input/train.csv")
train_label = train['target']
train_id = train['id']
del train['target'], train['id']

test = pd.read_csv("../input/test.csv")
test_id = test['id']
del test['id']

cat_fea = [x for x in list(train) if 'cat' in x]
bin_fea = [x for x in list(train) if 'bin' in x]

train['missing'] = (train==-1).sum(axis=1).astype(float)
test['missing'] = (test==-1).sum(axis=1).astype(float)

# include interactions
for e, (x, y) in enumerate(combinations(['ps_car_13', 'ps_ind_03', 'ps_reg_03', 'ps_ind_15', 'ps_reg_01', 'ps_ind_01'], 2)):
    train, test = interaction_features(train, test, x, y, e)

num_features = [c for c in list(train) if ('cat' not in c and 'calc' not in c)]
num_features.append('missing')
inter_fea = [x for x in list(train) if 'inter' in x]
#train['cat_sum'] = train[cat_fea].sum(axis=1)
#test['cat_sum'] = test[cat_fea].sum(axis=1)

path = "../input/"
num_features_comb = []
import os
for p in os.listdir(path):
    if 'ps_reg_02___ps_car_07_cat' in p or 'ps_reg_01___ps_car_13___ps_car_15' in p:
        print(p)
        x,xt = pd.read_pickle(path+p)
        train[p] = x
        test[p] = xt
        num_features_comb.append(p)

num_features += num_features_comb

feature_names = list(train)
ind_features = [c for c in feature_names if 'ind' in c]
count = 0
for c in ind_features:
    if count == 0:
        train['new_ind'] = train[c].astype(str)
        count += 1
    else:
        train['new_ind'] += '_' + train[c].astype(str)

print(train['new_ind'].nunique())
ind_features = [c for c in feature_names if 'ind' in c]
count = 0
for c in ind_features:
    if count == 0:
        test['new_ind'] = test[c].astype(str)
        count += 1
    else:
        test['new_ind'] += '_' + test[c].astype(str)

reg_features = [c for c in feature_names if 'reg' in c]
count = 0
for c in reg_features:
    if count == 0:
        train['new_reg'] = train[c].astype(str)
        count += 1
    else:
        train['new_reg'] += '_' + train[c].astype(str)

print(train['new_reg'].nunique())
reg_features = [c for c in feature_names if 'reg' in c]
count = 0
for c in reg_features:
    if count == 0:
        test['new_reg'] = test[c].astype(str)
        count += 1
    else:
        test['new_reg'] += '_' + test[c].astype(str)

car_features = [c for c in feature_names if 'car' in c]
count = 0
for c in car_features:
    if count == 0:
        train['new_car'] = train[c].astype(str)
        count += 1
    else:
        train['new_car'] += '_' + train[c].astype(str)

print(train['new_car'].nunique())
car_features = [c for c in feature_names if 'car' in c]
count = 0
for c in car_features:
    if count == 0:
        test['new_car'] = test[c].astype(str)
        count += 1
    else:
        test['new_car'] += '_' + test[c].astype(str)

new_ps_reg_03 = pd.read_pickle(path + 'new_ps_reg_03.pkl')
train['ps_reg_03'] = new_ps_reg_03[:train.shape[0]]
test['ps_reg_03'] = new_ps_reg_03[train.shape[0]:]
print(train['ps_reg_03'].head(10))


#X = train.as_matrix()
#X_test = test.as_matrix()
#print(X.shape, X_test.shape)
#ohe
ohe = OneHotEncoder(sparse=True)

train_cat = train[cat_fea].as_matrix()
train_num = train[[x for x in list(train) if x in num_features]]
test_cat = test[cat_fea].as_matrix()
test_num = test[[x for x in list(train) if x in num_features]]
train_cat[train_cat < 0] = 99
test_cat[test_cat < 0] = 99

traintest = np.vstack((train_cat, test_cat))
traintest = pd.DataFrame(traintest, columns=cat_fea)
print(traintest.shape)
#encoder = ce.HelmertEncoder(cols=cat_fea)
#encoder.fit(traintest)
#train_enc = encoder.transform(pd.DataFrame(train_cat, columns=cat_fea))
#test_enc = encoder.transform(pd.DataFrame(test_cat, columns=cat_fea))
ohe.fit(traintest)
train_ohe = ohe.transform(train_cat)
test_ohe = ohe.transform(test_cat)
del traintest

cat_count_features = []
for c in cat_fea + ['new_ind','new_reg','new_car']:
    d = pd.concat([train[c],test[c]]).value_counts().to_dict()
    train['%s_count'%c] = train[c].apply(lambda x:d.get(x,0))
    test['%s_count'%c] = test[c].apply(lambda x:d.get(x,0))
    cat_count_features.append('%s_count'%c)


print(train_num.dtypes)
train_list = [train_num.replace([np.inf, -np.inf, np.nan], 0), train_ohe, train[cat_count_features]]#, np.ones(shape=(train_num.shape[0], 1))]
test_list = [test_num.replace([np.inf, -np.inf, np.nan], 0), test_ohe, test[cat_count_features]]#, np.ones(shape=(test_num.shape[0], 1))]

#proj
for t in ['ps_car_13', 'ps_ind_03', 'ps_reg_03', 'ps_ind_15', 'ps_reg_01', 'ps_ind_01']:
    for g in ['ps_car_13', 'ps_ind_03', 'ps_reg_03', 'ps_ind_15', 'ps_reg_01', 'ps_ind_01', 'ps_ind_05_cat']:
        if t != g:
            s_train, s_test = proj_num_on_cat(train, test, target_column=t, group_column=g)
            train_list.append(s_train)
            test_list.append(s_test)
X = sparse.hstack(train_list).tocsr()
X_test = sparse.hstack(test_list).tocsr()

params = {"objective": "binary",
          "boosting_type": "gbdt",
          "learning_rate": learning_rate,
          "num_leaves": int(num_leaves),
           "max_bin": 256,
          "feature_fraction": feature_fraction,
          "verbosity": 0,
          "drop_rate": 0.1,
          "is_unbalance": False,
          "max_drop": 50,
          "min_child_samples": 10,
          "min_child_weight": 150,
          "min_split_gain": 0,
          "subsample": 0.9
          }

x_score = []
final_cv_train = np.zeros(len(train_label))
final_cv_pred = np.zeros(len(test_id))
for s in xrange(8):
    cv_train = np.zeros(len(train_label))
    cv_pred = np.zeros(len(test_id))

    params['seed'] = s

    if cv_only:
        kf = kfold.split(X, train_label)

        best_trees = []
        fold_scores = []

        for i, (train_fold, validate) in enumerate(kf):
            X_train, X_validate, label_train, label_validate = \
                X[train_fold, :], X[validate, :], train_label[train_fold], train_label[validate]
            dtrain = lgbm.Dataset(X_train, label_train)
            dvalid = lgbm.Dataset(X_validate, label_validate, reference=dtrain)
            bst = lgbm.train(params, dtrain, num_boost_round, valid_sets=dvalid, feval=evalerror, verbose_eval=100,
                            early_stopping_rounds=100)
            best_trees.append(bst.best_iteration)
            cv_pred += bst.predict(X_test, num_iteration=bst.best_iteration)
            cv_train[validate] += bst.predict(X_validate)

            score = Gini(label_validate, cv_train[validate])
            print score
            fold_scores.append(score)

        cv_pred /= NFOLDS
        final_cv_train += cv_train
        final_cv_pred += cv_pred

        print("cv score:")
        print Gini(train_label, cv_train)
        print "current score:", Gini(train_label, final_cv_train / (s + 1.)), s+1
        print(fold_scores)
        print(best_trees, np.mean(best_trees))

        x_score.append(Gini(train_label, cv_train))

print(x_score)
pd.DataFrame({'id': test_id, 'target': final_cv_pred / 16.}).to_csv('../model/lgbm8_pred_avg.csv', index=False)
pd.DataFrame({'id': train_id, 'target': final_cv_train / 16.}).to_csv('../model/lgbm8_cv_avg.csv', index=False)

#cv score:
#0.287007087138
#current score: 0.289683837899 16


================================================
FILE: code_for_exact_solution/logistic1.py
================================================
'''
simple xgboost benchmark
'''
from sklearn.model_selection import StratifiedKFold, KFold
from sklearn.feature_selection import SelectPercentile, f_classif
import numpy as np
import pandas as pd
from util import Gini, proj_num_on_cat, interaction_features, cat_count
from sklearn.preprocessing import LabelEncoder
from itertools import combinations
from util import proj_num_on_cat
from sklearn.preprocessing import OneHotEncoder
import category_encoders as ce
from scipy import sparse
from sklearn.linear_model import LogisticRegression as LR
from sklearn.preprocessing import StandardScaler
from scipy.sparse import csr_matrix
import pickle
cv_only = True
save_cv = True
full_train = True

def evalerror(preds, dtrain):
    labels = dtrain.get_label()
    return 'gini', Gini(labels, preds)

NFOLDS = 5
kfold = StratifiedKFold(n_splits=NFOLDS, shuffle=True, random_state=218)

train = pd.read_csv("../input/train.csv")
train_label = train['target']
train_id = train['id']
del train['target'], train['id']

test = pd.read_csv("../input/test.csv")
test_id = test['id']
del test['id']

train_copy = train.copy()
test_copy = test.copy()
train_copy = train_copy.replace(-1, np.NaN)
test_copy = test_copy.replace(-1, np.NaN)
train['num_na'] = train_copy.isnull().sum(axis=1)
test['num_na'] = test_copy.isnull().sum(axis=1)
del train_copy, test_copy

cat_fea = [x for x in list(train) if 'cat' in x]
bin_fea = [x for x in list(train) if 'bin' in x]
train_cat_count, test_cat_count = cat_count(train, test, cat_fea)
# include interactions
for e, (x, y) in enumerate(combinations(['ps_car_13', 'ps_ind_03', 'ps_reg_03', 'ps_ind_15', 'ps_reg_01', 'ps_ind_01'], 2)):
    train, test = interaction_features(train, test, x, y, e)


inter_fea = [x for x in list(train) if 'inter' in x]
#train['cat_sum'] = train[cat_fea].sum(axis=1)
#test['cat_sum'] = test[cat_fea].sum(axis=1)


#X = train.as_matrix()
#X_test = test.as_matrix()
#print(X.shape, X_test.shape)
#ohe
ohe = OneHotEncoder(sparse=True)

train_cat = train[cat_fea].as_matrix()
train_num = train[[x for x in list(train) if x not in cat_fea]]
test_cat = test[cat_fea].as_matrix()
test_num = test[[x for x in list(train) if x not in cat_fea]]
train_cat[train_cat < 0] = 99
test_cat[test_cat < 0] = 99

traintest = np.vstack((train_cat, test_cat))
traintest = pd.DataFrame(traintest, columns=cat_fea)
print(traintest.shape)
#encoder = ce.HelmertEncoder(cols=cat_fea)
#encoder.fit(traintest)
#train_enc = encoder.transform(pd.DataFrame(train_cat, columns=cat_fea))
#test_enc = encoder.transform(pd.DataFrame(test_cat, columns=cat_fea))
ohe.fit(traintest)
train_ohe = ohe.transform(train_cat)
test_ohe = ohe.transform(test_cat)
del traintest

train_fea0, test_fea0 = pickle.load(open("../input/fea0.pk"))


train_list = [train_num.replace([np.inf, -np.inf, np.nan], 0), train_ohe, train_fea0, train_cat_count]#, np.ones(shape=(train_num.shape[0], 1))]
test_list = [test_num.replace([np.inf, -np.inf, np.nan], 0), test_ohe, test_fea0, test_cat_count]#, np.ones(shape=(test_num.shape[0], 1))]

#proj
for t in ['ps_car_13', 'ps_ind_03', 'ps_reg_03', 'ps_ind_15', 'ps_reg_01', 'ps_ind_01']:
    for g in ['ps_car_13', 'ps_ind_03', 'ps_reg_03', 'ps_ind_15', 'ps_reg_01', 'ps_ind_01', 'ps_ind_05_cat']:
        if t != g:
            s_train, s_test = proj_num_on_cat(train, test, target_column=t, group_column=g)
            train_list.append(s_train)
            test_list.append(s_test)

X = sparse.hstack(train_list).tocsr()
X_test = sparse.hstack(test_list).tocsr()
#X = train_num
#X_test = test_num
#all_data = np.vstack([X, X_test])

selector = SelectPercentile(f_classif, 75)
selector.fit(X.toarray(), train_label)
X, X_test = selector.transform(X.toarray()), selector.transform(X_test.toarray())

all_data = np.vstack([X, X_test])
scaler = StandardScaler()
scaler.fit(all_data)

X = csr_matrix(scaler.transform(X))
X_test = csr_matrix(scaler.transform(X_test))
#X = scaler.transform(X)
#X_test = scaler.transform(X_test)

print(X.shape, X_test.shape)

kf = kfold.split(X, train_label)
cv_train = np.zeros(len(train_label))
cv_pred = np.zeros(len(test_id))
best_trees = []
fold_scores = []

if cv_only:
    for i, (train_fold, validate) in enumerate(kf):
        X_train, X_validate, label_train, label_validate = \
            X[train_fold, :], X[validate, :], train_label[train_fold], train_label[validate]

        #selector = SelectPercentile(f_classif, X_train.toarray(), label_train)
        #X_train, X_validate = csr_matrix(selector.transform(X_train.toarray())), csr_matrix(selector.transform(
        #    X_validate.toarray()
        #))

        clf = LR(C=25.)
        clf.fit(X_train, label_train)
        cv_pred += clf.predict_proba(X_test)[:, 1]
        cv_train[validate] += clf.predict_proba(X_validate)[:, 1]
        score = Gini(label_validate, cv_train[validate])
        print score
        fold_scores.append(score)

    print("cv score:")
    print Gini(train_label, cv_train)
    print(fold_scores)

    if save_cv:
        pd.DataFrame({'id': test_id, 'target': cv_pred/NFOLDS}).to_csv('../model/logistic1_pred.csv', index=False)
        pd.DataFrame({'id': train_id, 'target': cv_train}).to_csv('../model/logistic1_cv.csv', index=False)

#cv score:
#0.27906438918
#[0.28773918373071583, 0.26723600995806762, 0.28158062789785737, 0.27506435916773242, 0.28394519465855006]

if full_train:
    clf = LR(C=25.)
    clf.fit(X, train_label)
    pred = clf.predict_proba(X_test)[:, 1]
    pd.DataFrame({'id': test_id, 'target': pred}).to_csv('../model/logistic1_full_pred.csv', index=False)




================================================
FILE: code_for_exact_solution/rank_average.py
================================================
import pandas as pd
from util import Gini

def get_rank(x):
    return pd.Series(x).rank(pct=True).values

train = pd.read_csv("../input/train.csv", usecols = ['target'])

keras3_train = pd.read_csv("../model/keras3_cv.csv")
keras5_train = pd.read_csv("../model/keras5_cv.csv")
keras6_train = pd.read_csv("../model/keras6_cv.csv")
keras7_train = pd.read_csv("../model/keras7_cv.csv")

lgbm1_train = pd.read_csv("../model/lgbm1_cv_avg.csv")
lgbm3_train = pd.read_csv("../model/lgbm3_cv_avg.csv")
lgbm8_train = pd.read_csv("../model/lgbm8_cv_avg.csv")
lgbm5_train = pd.read_csv("../model/lgbm5_cv_avg.csv")
lgbm6_train = pd.read_csv("../model/lgbm6_cv_avg.csv")
lgbm7_train = pd.read_csv("../model/lgbm7_cv_avg.csv")


logistic1_train = pd.read_csv("../model/logistic1_cv.csv")

xgb0_train = pd.read_csv("../model/xgb0_cv.csv")

keras3_test = pd.read_csv("../model/keras3_pred.csv")
keras5_test = pd.read_csv("../model/keras5_pred.csv")
keras6_test = pd.read_csv("../model/keras6_pred.csv")
keras7_test = pd.read_csv("../model/keras7_pred.csv")


lgbm1_test = pd.read_csv("../model/lgbm1_pred_avg.csv")
lgbm3_test = pd.read_csv("../model/lgbm3_pred_avg.csv")
lgbm8_test = pd.read_csv("../model/lgbm8_pred_avg.csv")
lgbm5_test = pd.read_csv("../model/lgbm5_pred_avg.csv")
lgbm6_test = pd.read_csv("../model/lgbm6_pred_avg.csv")
lgbm7_test = pd.read_csv("../model/lgbm7_pred_avg.csv")


logistic1_test = pd.read_csv("../model/logistic1_pred.csv")

xgb0_test = pd.read_csv("../model/xgb0_pred.csv")

xgblinear_train = pd.read_csv("../model/xgb0l_cv.csv")
xgblinear_test = pd.read_csv("../model/xgb0l_pred.csv")


result = get_rank(keras5_train['target']) * 0.4 + get_rank(lgbm3_train['target']) * 0.5 + \
         get_rank(xgb0_train['target']) * 0.1 + get_rank(lgbm1_train['target']) * (-0.1) + \
         get_rank(keras3_train['target']) * 0.1 + get_rank(logistic1_train['target']) * 0.1 + \
         get_rank(xgblinear_train['target']) * 0.1 + get_rank(lgbm8_train['target']) * 0.25 + \
         get_rank(lgbm5_train['target']) * 0.1 + \
         get_rank(lgbm6_train['target']) * (-0.1) + get_rank(lgbm7_train['target']) * (0.1) + \
         get_rank(keras6_train['target']) * (-0.1) + \
         get_rank(keras7_train['target']) * 0.3

print "cv of final averaged model:", Gini(train['target'], result)

result = get_rank(keras5_test['target']) * 0.4 + get_rank(lgbm3_test['target']) * 0.5 + \
    get_rank(xgb0_test['target']) * 0.1 + get_rank(lgbm1_test['target']) * (-0.1) + \
    get_rank(keras3_test['target']) * 0.1 + get_rank(logistic1_test['target']) * 0.1 + \
    get_rank(xgblinear_test['target']) * 0.1 + get_rank(lgbm8_test['target']) * 0.25 + \
    get_rank(lgbm5_test['target']) * 0.1 + \
    get_rank(lgbm6_test['target']) * (-0.1) + get_rank(lgbm7_test['target']) * (0.1) + \
    get_rank(keras6_test['target']) * (-0.1) + \
    get_rank(keras7_test['target']) * 0.3

pd.DataFrame({'id': keras5_test['id'], 'target': get_rank(result)}).to_csv("../model/all_average.csv", index = False)

================================================
FILE: code_for_exact_solution/util.py
================================================
import numpy as np
import pandas as pd

def Gini(y_true, y_pred):
    # check and get number of samples
    assert y_true.shape == y_pred.shape
    n_samples = y_true.shape[0]

    # sort rows on prediction column
    # (from largest to smallest)
    arr = np.array([y_true, y_pred]).transpose()
    true_order = arr[arr[:, 0].argsort()][::-1, 0]
    pred_order = arr[arr[:, 1].argsort()][::-1, 0]

    # get Lorenz curves
    L_true = np.cumsum(true_order) * 1. / np.sum(true_order)
    L_pred = np.cumsum(pred_order) * 1. / np.sum(pred_order)
    L_ones = np.linspace(1 / n_samples, 1, n_samples)

    # get Gini coefficients (area between curves)
    G_true = np.sum(L_ones - L_true)
    G_pred = np.sum(L_ones - L_pred)

    # normalize to true Gini coefficient
    return G_pred * 1. / G_true


def cat_count(train_df, test_df, cat_list):
    train_df['row_id'] = range(train_df.shape[0])
    test_df['row_id'] = range(test_df.shape[0])
    train_df['train'] = 1
    test_df['train'] = 0
    all_df = train_df[['row_id', 'train'] + cat_list].append(test_df[['row_id','train'] + cat_list])
    for e, cat in enumerate(cat_list):
        grouped = all_df[[cat]].groupby(cat)
        the_size = pd.DataFrame(grouped.size()).reset_index()
        the_size.columns = [cat, '{}_size'.format(cat)]
        all_df = pd.merge(all_df, the_size, how='left')

        selected_train = all_df[all_df['train'] == 1]
        selected_test = all_df[all_df['train'] == 0]
        selected_train.sort_values('row_id', inplace=True)
        selected_test.sort_values('row_id', inplace=True)
        selected_train.drop(['row_id', 'train'] + cat_list, axis=1, inplace=True)
        selected_test.drop(['row_id', 'train'] + cat_list, axis=1, inplace=True)

        selected_train, selected_test = np.array(selected_train), np.array(selected_test)
    print(selected_train.shape, selected_test.shape)
    return selected_train, selected_test


def proj_num_on_cat(train_df, test_df, target_column, group_column):
    """
    :param train_df: train data frame
    :param test_df:  test data frame
    :param target_column: name of numerical feature
    :param group_column: name of categorical feature
    """
    train_df['row_id'] = range(train_df.shape[0])
    test_df['row_id'] = range(test_df.shape[0])
    train_df['train'] = 1
    test_df['train'] = 0
    all_df = train_df[['row_id', 'train', target_column, group_column]].append(test_df[['row_id','train',
                                                                                        target_column, group_column]])
    grouped = all_df[[target_column, group_column]].groupby(group_column)
    the_size = pd.DataFrame(grouped.size()).reset_index()
    the_size.columns = [group_column, '%s_size' % target_column]
    the_mean = pd.DataFrame(grouped.mean()).reset_index()
    the_mean.columns = [group_column, '%s_mean' % target_column]
    the_std = pd.DataFrame(grouped.std()).reset_index().fillna(0)
    the_std.columns = [group_column, '%s_std' % target_column]
    the_median = pd.DataFrame(grouped.median()).reset_index()
    the_median.columns = [group_column, '%s_median' % target_column]
    the_stats = pd.merge(the_size, the_mean)
    the_stats = pd.merge(the_stats, the_std)
    the_stats = pd.merge(the_stats, the_median)

    the_max = pd.DataFrame(grouped.max()).reset_index()
    the_max.columns = [group_column, '%s_max' % target_column]
    the_min = pd.DataFrame(grouped.min()).reset_index()
    the_min.columns = [group_column, '%s_min' % target_column]

    the_stats = pd.merge(the_stats, the_max)
    the_stats = pd.merge(the_stats, the_min)

    all_df = pd.merge(all_df, the_stats, how='left')

    selected_train = all_df[all_df['train'] == 1]
    selected_test = all_df[all_df['train'] == 0]
    selected_train.sort_values('row_id', inplace=True)
    selected_test.sort_values('row_id', inplace=True)
    selected_train.drop([target_column, group_column, 'row_id', 'train'], axis=1, inplace=True)
    selected_test.drop([target_column, group_column, 'row_id', 'train'], axis=1, inplace=True)

    selected_train, selected_test = np.array(selected_train), np.array(selected_test)
    print(selected_train.shape, selected_test.shape)
    return selected_train, selected_test


def interaction_features(train, test, fea1, fea2, prefix):
    train['inter_{}*'.format(prefix)] = train[fea1] * train[fea2]
    train['inter_{}/'.format(prefix)] = train[fea1] / train[fea2]

    test['inter_{}*'.format(prefix)] = test[fea1] * test[fea2]
    test['inter_{}/'.format(prefix)] = test[fea1] / test[fea2]

    return train, test


================================================
FILE: code_for_exact_solution/xgb0.py
================================================
'''
simple xgboost benchmark
'''
import xgboost as xgb
from sklearn.model_selection import StratifiedKFold, KFold
import numpy as np
import pandas as pd
from util import Gini
from sklearn.preprocessing import LabelEncoder
from itertools import combinations
from util import proj_num_on_cat
from sklearn.preprocessing import OneHotEncoder
import category_encoders as ce
from scipy import sparse

cv_only = True
save_cv = True
full_train = False

def evalerror(preds, dtrain):
    labels = dtrain.get_label()
    return 'gini', Gini(labels, preds)

NFOLDS = 5
kfold = StratifiedKFold(n_splits=NFOLDS, shuffle=True, random_state=218)
#kfold = KFold(n_splits=NFOLDS, shuffle=True, random_state=0)
eta = 0.05
max_depth = 7
subsample = 0.97
colsample_bytree = 0.85
gamma = 0.05
alpha = 0
min_child_weight = 55
#lamb = 0.35
colsample_bylevel = 0.8
num_boost_round = 10000

train = pd.read_csv("../input/train.csv")
train_label = train['target']
train_id = train['id']
del train['target'], train['id']

test = pd.read_csv("../input/test.csv")
test_id = test['id']
del test['id']

train_copy = train.copy()
test_copy = test.copy()
train_copy = train_copy.replace(-1, np.NaN)
test_copy = test_copy.replace(-1, np.NaN)
train['num_na'] = train_copy.isnull().sum(axis=1)
test['num_na'] = test_copy.isnull().sum(axis=1)
del train_copy, test_copy

cat_fea = [x for x in list(train) if 'cat' in x]
bin_fea = [x for x in list(train) if 'bin' in x]

#train['cat_sum'] = train[cat_fea].sum(axis=1)
#test['cat_sum'] = test[cat_fea].sum(axis=1)


#X = train.as_matrix()
#X_test = test.as_matrix()
#print(X.shape, X_test.shape)
#ohe
ohe = OneHotEncoder(sparse=True)

cat_fea = [x for x in list(train) if 'cat' in x]
train_cat = train[cat_fea].as_matrix()
train_num = train[[x for x in list(train) if x not in cat_fea]]
test_cat = test[cat_fea].as_matrix()
test_num = test[[x for x in list(train) if x not in cat_fea]]
train_cat[train_cat < 0] = 99
test_cat[test_cat < 0] = 99

traintest = np.vstack((train_cat, test_cat))
traintest = pd.DataFrame(traintest, columns=cat_fea)
print(traintest.shape)
#encoder = ce.HelmertEncoder(cols=cat_fea)
#encoder.fit(traintest)
#train_enc = encoder.transform(pd.DataFrame(train_cat, columns=cat_fea))
#test_enc = encoder.transform(pd.DataFrame(test_cat, columns=cat_fea))
ohe.fit(traintest)
train_ohe = ohe.transform(train_cat)
test_ohe = ohe.transform(test_cat)
del traintest

train_list = [train_num, train_ohe]#, np.ones(shape=(train_num.shape[0], 1))]
test_list = [test_num, test_ohe]#, np.ones(shape=(test_num.shape[0], 1))]

X = sparse.hstack(train_list).tocsr()
X_test = sparse.hstack(test_list).tocsr()
#X, X_test = X.toarray(), X_test.toarray()
print(X.shape, X_test.shape)

final_cv_train = np.zeros(len(train_label))
final_cv_pred = np.zeros(len(test_id))
final_best_trees = []

params = {"objective": "binary:logistic",
          "booster": "gbtree",
          "eta": eta,
          "max_depth": int(max_depth),
          "subsample": subsample,
          "colsample_bytree": colsample_bytree,
          "gamma": gamma,
          #"lamb": lamb,
          "alpha": alpha,
          "min_child_weight": min_child_weight,
          "colsample_bylevel": colsample_bylevel,
          "silent": 1
          }

if cv_only:
    num_seeds = 24
    for s in xrange(num_seeds):
        print(s)
        params['seed'] = s
        kf = kfold.split(X, train_label)
        cv_train = np.zeros(len(train_label))
        cv_pred = np.zeros(len(test_id))
        best_trees = []
        fold_scores = []

        for i, (train_fold, validate) in enumerate(kf):
            X_train, X_validate, label_train, label_validate = \
                X[train_fold, :], X[validate, :], train_label[train_fold], train_label[validate]
            dtrain = xgb.DMatrix(X_train, label_train)
            dvalid = xgb.DMatrix(X_validate, label_validate)
            watchlist = [(dtrain, 'train'), (dvalid, 'valid')]
            bst = xgb.train(params, dtrain, num_boost_round, evals=watchlist, feval=evalerror, verbose_eval=800,
                            early_stopping_rounds=25, maximize=True)
            best_trees.append(bst.best_iteration)
            cv_pred += bst.predict(xgb.DMatrix(X_test))
            cv_train[validate] += bst.predict(xgb.DMatrix(X_validate), ntree_limit=bst.best_ntree_limit)
            score = Gini(label_validate, cv_train[validate])
            print score
            fold_scores.append(score)

        final_cv_train += cv_train
        final_cv_pred += cv_pred
        final_best_trees += best_trees
        print("cv score:")
        print Gini(train_label, cv_train)
        print(fold_scores)
        print(best_trees, np.mean(best_trees))
        print("current score:", Gini(train_label, final_cv_train * 1. / (s + 1)), s+1)


    final_cv_pred /= (NFOLDS * num_seeds)
    final_cv_train /= num_seeds
    pd.DataFrame({'id': test_id, 'target': final_cv_pred}).to_csv('../model/xgb_avg16_pred.csv', index=False)
    pd.DataFrame({'id': train_id, 'target': final_cv_train}).to_csv('../model/xgb_avg16_cv.csv', index=False)
    print(np.mean(final_best_trees), np.median(final_best_trees), np.std(final_best_trees))

    ## 0.1
    #0.281739276885
    #[0.28693135981084533, 0.26989064676756958, 0.28035898856108521, 0.28178381987103512, 0.29021910168396381]
    #([123, 91, 139, 97, 92], 108.40000000000001)


    #0.284350552387
    #([1057, 933, 1175, 979, 1168], 1062.4000000000001)

if full_train:
    for s in xrange(32):
        params['seed'] = s
        dtrain = xgb.DMatrix(X, train_label)
        watchlist = [(dtrain, 'train')]
        bst = xgb.train(params, dtrain, 100, evals=watchlist, feval=evalerror, verbose_eval=50, maximize=True)
        pred = bst.predict(xgb.DMatrix(X_test))
        if s == 0:
            final_pred = pred
        else:
            final_pred += pred
    pd.DataFrame({'id': test_id, 'target': final_pred / 32.}).to_csv('../model/xgb_avg_full_pred.csv', index=False)




================================================
FILE: code_for_exact_solution/xgb_linear0.py
================================================
import os
import sys
import operator
import numpy as np
import pandas as pd
from scipy import sparse
import xgboost as xgb
from sklearn import model_selection, preprocessing, ensemble
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation
from keras.layers.normalization import BatchNormalization
from keras.layers.advanced_activations import PReLU
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import StratifiedKFold
from time import time
import datetime
from util import Gini, proj_num_on_cat, interaction_features, cat_count
from sklearn.preprocessing import StandardScaler
from itertools import combinations
import pickle

'''
simple xgboost benchmark
'''
from sklearn.model_selection import StratifiedKFold, KFold
from sklearn.feature_selection import SelectPercentile, f_classif
import numpy as np
import pandas as pd
from util import Gini, proj_num_on_cat, interaction_features, cat_count
from sklearn.preprocessing import LabelEncoder
from itertools import combinations
from util import proj_num_on_cat
from sklearn.preprocessing import OneHotEncoder
import category_encoders as ce
from scipy import sparse
from sklearn.linear_model import LogisticRegression as LR
from sklearn.preprocessing import StandardScaler
from scipy.sparse import csr_matrix
import pickle
cv_only = True
save_cv = True
full_train = True

def evalerror(preds, dtrain):
    labels = dtrain.get_label()
    return 'gini', Gini(labels, preds)

NFOLDS = 5
kfold = StratifiedKFold(n_splits=NFOLDS, shuffle=True, random_state=218)

train = pd.read_csv("../input/train.csv")
train_label = train['target']
train_id = train['id']
del train['target'], train['id']

test = pd.read_csv("../input/test.csv")
test_id = test['id']
del test['id']

cat_fea = [x for x in list(train) if 'cat' in x]
bin_fea = [x for x in list(train) if 'bin' in x]

train['missing'] = (train==-1).sum(axis=1).astype(float)
test['missing'] = (test==-1).sum(axis=1).astype(float)

# include interactions
for e, (x, y) in enumerate(combinations(['ps_car_13', 'ps_ind_03', 'ps_reg_03', 'ps_ind_15', 'ps_reg_01', 'ps_ind_01'], 2)):
    train, test = interaction_features(train, test, x, y, e)

num_features = [c for c in list(train) if ('cat' not in c and 'calc' not in c)]
num_features.append('missing')
inter_fea = [x for x in list(train) if 'inter' in x]
#train['cat_sum'] = train[cat_fea].sum(axis=1)
#test['cat_sum'] = test[cat_fea].sum(axis=1)


#X = train.as_matrix()
#X_test = test.as_matrix()
#print(X.shape, X_test.shape)
#ohe
ohe = OneHotEncoder(sparse=True)

train_cat = train[cat_fea].as_matrix()
train_num = train[[x for x in list(train) if x in num_features]]
test_cat = test[cat_fea].as_matrix()
test_num = test[[x for x in list(train) if x in num_features]]
train_cat[train_cat < 0] = 99
test_cat[test_cat < 0] = 99

traintest = np.vstack((train_cat, test_cat))
traintest = pd.DataFrame(traintest, columns=cat_fea)
print(traintest.shape)
#encoder = ce.HelmertEncoder(cols=cat_fea)
#encoder.fit(traintest)
#train_enc = encoder.transform(pd.DataFrame(train_cat, columns=cat_fea))
#test_enc = encoder.transform(pd.DataFrame(test_cat, columns=cat_fea))
ohe.fit(traintest)
train_ohe = ohe.transform(train_cat)
test_ohe = ohe.transform(test_cat)
del traintest

train_fea0, test_fea0 = pickle.load(open("../input/fea0.pk"))
train_fea1, test_fea1 = pickle.load(open("../input/fea0_lgb.pk"))

cat_count_features = []
for c in cat_fea:
    d = pd.concat([train[c],test[c]]).value_counts().to_dict()
    train['%s_count'%c] = train[c].apply(lambda x:d.get(x,0))
    test['%s_count'%c] = test[c].apply(lambda x:d.get(x,0))
    cat_count_features.append('%s_count'%c)

train_list = [train_num.replace([np.inf, -np.inf, np.nan], 0), train_ohe, train[cat_count_features]]#, np.ones(shape=(train_num.shape[0], 1))]
test_list = [test_num.replace([np.inf, -np.inf, np.nan], 0), test_ohe, test[cat_count_features]]#, np.ones(shape=(test_num.shape[0], 1))]


t_fea = ['ps_car_13', 'ps_ind_03', 'ps_reg_03', 'ps_ind_15', 'ps_reg_01', 'ps_ind_01']
g_fea = ['ps_car_13', 'ps_ind_03', 'ps_reg_03', 'ps_ind_15', 'ps_reg_01', 'ps_ind_01', 'ps_ind_05_cat'] + cat_fea

t_fea = list(set(t_fea))
g_fea = list(set(g_fea))

#proj
for t in t_fea:
    for g in g_fea:
        if t != g:
            s_train, s_test = proj_num_on_cat(train, test, target_column=t, group_column=g)
            train_list.append(s_train)
            test_list.append(s_test)
X = sparse.hstack(train_list).tocsr()
X_test = sparse.hstack(test_list).tocsr()
#X = train_num
#X_test = test_num
all_data = np.vstack([X.toarray(), X_test.toarray()])
#all_data = np.vstack([X, X_test])

scaler = StandardScaler()
scaler.fit(all_data)
X = scaler.transform(X.toarray())
X_test = scaler.transform(X_test.toarray())
#X = scaler.transform(X)
#X_test = scaler.transform(X_test)

print(X.shape, X_test.shape)

final_cv_train = np.zeros(len(train_label))
final_cv_pred = np.zeros(len(test_id))
final_best_trees = []

eta = 0.1
lamb = 0.25
alpha = 1
num_boost_round = 10000

params = {"objective": "binary:logistic",
          "booster": "gbtree",
          "eta": eta,
          "lamb": lamb,
          "alpha": alpha,
          "silent": 1
          }

if cv_only:
    num_seeds = 3
    for s in xrange(num_seeds):
        print(s)
        params['seed'] = s
        kf = kfold.split(X, train_label)
        cv_train = np.zeros(len(train_label))
        cv_pred = np.zeros(len(test_id))
        best_trees = []
        fold_scores = []

        for i, (train_fold, validate) in enumerate(kf):
            X_train, X_validate, label_train, label_validate = \
                X[train_fold, :], X[validate, :], train_label[train_fold], train_label[validate]
            dtrain = xgb.DMatrix(X_train, label_train)
            dvalid = xgb.DMatrix(X_validate, label_validate)
            watchlist = [(dtrain, 'train'), (dvalid, 'valid')]
            bst = xgb.train(params, dtrain, num_boost_round, evals=watchlist, feval=evalerror, verbose_eval=100,
                            early_stopping_rounds=50, maximize=True)
            best_trees.append(bst.best_iteration)
            cv_pred += bst.predict(xgb.DMatrix(X_test))
            cv_train[validate] += bst.predict(xgb.DMatrix(X_validate), ntree_limit=bst.best_ntree_limit)
            score = Gini(label_validate, cv_train[validate])
            print score
            fold_scores.append(score)

        final_cv_train += cv_train
        final_cv_pred += cv_pred
        final_best_trees += best_trees
        print("cv score:")
        print Gini(train_label, cv_train)
        print(fold_scores)
        print(best_trees, np.mean(best_trees))
        print("current score:", Gini(train_label, final_cv_train * 1. / (s + 1)), s+1)


    final_cv_pred /= (NFOLDS * num_seeds)
    final_cv_train /= num_seeds
    pd.DataFrame({'id': test_id, 'target': final_cv_pred}).to_csv('../model/xgb0l_pred.csv', index=False)
    pd.DataFrame({'id': train_id, 'target': final_cv_train}).to_csv('../model/xgb0l_cv.csv', index=False)
    print(np.mean(final_best_trees), np.median(final_best_trees), np.std(final_best_trees))

================================================
FILE: input/readme
================================================
unzipped data goes here

================================================
FILE: model/readme
================================================
models will be saved here

================================================
FILE: readme.md
================================================
### Requirements

*older or newer version of below packages should theoretically work fine*

python 2.7

numpy 1.13.3

pandas 0.20.3

sklearn 0.19.1

keras 2.1.1

tensorflow 1.4.0

xgboost 0.6

lightgbm 2.0.10


### How to reproduce

#### Simple solution (recommended)
Put unzipped data in `input`

*Generate a simple solution that is good enough for 2nd place (~0.2938 on private LB)*

`cd code`

`python fea_eng0.py`

`python nn_model290.py` to get a nn model that scores 0.290X

`python gbm_model291.py` to get a gbm model that scores 0.291X

`python simple_average.py` and then you can find the submission file at `../model/simple_average.csv`

You can reproduce this solution in a few hours.

#### Exact solution (Optional)

Although not recommended but you can also reproduce the exact same solution we submitted (0.29413 on private LB).

*you can follow these steps below, in addition to the simple solution*

```
cd ../code_for_exact_solution/
python keras3.py
python keras6.py
python keras7.py
python lightgbm1.py
python lightgbm5.py
python lightgbm6.py
python lightgbm7.py
python lightgbm8.py
python logistic1.py
python xgb0.py
python xgb_linear0.py
python rank_average.py
```

*It can take up to 2 days to generate the exact solution which only has 0.0003 improvement over the simple one*
Download .txt
gitextract_dcpwav_6/

├── Jupyter_nnmodel/
│   ├── README.md
│   ├── fea_eng0.py
│   ├── feature_generater.py
│   ├── nn_model .ipynb
│   └── util.py
├── code/
│   ├── fea_eng0.py
│   ├── gbm_model291.py
│   ├── nn_model290.py
│   ├── simple_average.py
│   └── util.py
├── code_for_exact_solution/
│   ├── keras3.py
│   ├── keras6.py
│   ├── keras7.py
│   ├── lightgbm1.py
│   ├── lightgbm5.py
│   ├── lightgbm6.py
│   ├── lightgbm7.py
│   ├── lightgbm8.py
│   ├── logistic1.py
│   ├── rank_average.py
│   ├── util.py
│   ├── xgb0.py
│   └── xgb_linear0.py
├── input/
│   └── readme
├── model/
│   └── readme
└── readme.md
Download .txt
SYMBOL INDEX (40 symbols across 19 files)

FILE: Jupyter_nnmodel/feature_generater.py
  function Multiply_Divide (line 6) | def Multiply_Divide(train, test, features):
  function Series_string (line 24) | def Series_string(train, test, category_list):
  function Features_Counts (line 61) | def Features_Counts(train, test, features):
  function Statistic_features (line 73) | def Statistic_features(train, test, target_features, group_features):
  function features_type (line 84) | def features_type(train):

FILE: Jupyter_nnmodel/util.py
  function Gini (line 4) | def Gini(y_true, y_pred):
  function cat_count (line 28) | def cat_count(train_df, test_df, cat_list):
  function proj_num_on_cat (line 52) | def proj_num_on_cat(train_df, test_df, target_column, group_column):
  function interaction_features (line 120) | def interaction_features(train, test, fea1, fea2, prefix):

FILE: code/gbm_model291.py
  function Gini (line 9) | def Gini(y_true, y_pred):
  function evalerror (line 36) | def evalerror(preds, dtrain):

FILE: code/nn_model290.py
  function nn_model (line 159) | def nn_model():
  function get_rank (line 222) | def get_rank(x):

FILE: code/simple_average.py
  function get_rank (line 8) | def get_rank(x):

FILE: code/util.py
  function Gini (line 4) | def Gini(y_true, y_pred):
  function cat_count (line 28) | def cat_count(train_df, test_df, cat_list):
  function proj_num_on_cat (line 52) | def proj_num_on_cat(train_df, test_df, target_column, group_column):
  function interaction_features (line 100) | def interaction_features(train, test, fea1, fea2, prefix):

FILE: code_for_exact_solution/keras3.py
  function evalerror (line 44) | def evalerror(preds, dtrain):
  function nn_model (line 141) | def nn_model():

FILE: code_for_exact_solution/keras6.py
  function evalerror (line 44) | def evalerror(preds, dtrain):
  function nn_model (line 213) | def nn_model():
  function nn_model (line 262) | def nn_model():
  function get_rank (line 325) | def get_rank(x):

FILE: code_for_exact_solution/keras7.py
  function evalerror (line 44) | def evalerror(preds, dtrain):
  function nn_model (line 230) | def nn_model():
  function get_rank (line 293) | def get_rank(x):

FILE: code_for_exact_solution/lightgbm1.py
  function evalerror (line 19) | def evalerror(preds, dtrain):

FILE: code_for_exact_solution/lightgbm5.py
  function evalerror (line 50) | def evalerror(preds, dtrain):

FILE: code_for_exact_solution/lightgbm6.py
  function evalerror (line 50) | def evalerror(preds, dtrain):

FILE: code_for_exact_solution/lightgbm7.py
  function evalerror (line 50) | def evalerror(preds, dtrain):

FILE: code_for_exact_solution/lightgbm8.py
  function evalerror (line 62) | def evalerror(preds, dtrain):

FILE: code_for_exact_solution/logistic1.py
  function evalerror (line 23) | def evalerror(preds, dtrain):

FILE: code_for_exact_solution/rank_average.py
  function get_rank (line 4) | def get_rank(x):

FILE: code_for_exact_solution/util.py
  function Gini (line 4) | def Gini(y_true, y_pred):
  function cat_count (line 28) | def cat_count(train_df, test_df, cat_list):
  function proj_num_on_cat (line 52) | def proj_num_on_cat(train_df, test_df, target_column, group_column):
  function interaction_features (line 100) | def interaction_features(train, test, fea1, fea2, prefix):

FILE: code_for_exact_solution/xgb0.py
  function evalerror (line 20) | def evalerror(preds, dtrain):

FILE: code_for_exact_solution/xgb_linear0.py
  function evalerror (line 44) | def evalerror(preds, dtrain):
Condensed preview — 26 files, each showing path, character count, and a content snippet. Download the .json file or copy for the full structured content (172K chars).
[
  {
    "path": "Jupyter_nnmodel/README.md",
    "chars": 3295,
    "preview": "# Content: Jupyter Version of 2nd place code kaggle-porto-seguro\n## [Porto Seguro’s Safe Driver Prediction](https://www."
  },
  {
    "path": "Jupyter_nnmodel/fea_eng0.py",
    "chars": 2370,
    "preview": "\"\"\"xgb prediction as features\"\"\"\nimport xgboost as xgb\nfrom sklearn.model_selection import KFold\nimport numpy as np\nimpo"
  },
  {
    "path": "Jupyter_nnmodel/feature_generater.py",
    "chars": 3701,
    "preview": "from util import proj_num_on_cat, Gini, interaction_features\nfrom itertools import combinations\nimport numpy as np\nimpor"
  },
  {
    "path": "Jupyter_nnmodel/nn_model .ipynb",
    "chars": 39564,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 1,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\":"
  },
  {
    "path": "Jupyter_nnmodel/util.py",
    "chars": 5243,
    "preview": "import numpy as np\nimport pandas as pd\n\ndef Gini(y_true, y_pred):\n    # check and get number of samples\n    assert y_tru"
  },
  {
    "path": "code/fea_eng0.py",
    "chars": 2239,
    "preview": "\"\"\"xgb prediction as features\"\"\"\nimport xgboost as xgb\nfrom sklearn.model_selection import KFold\nimport numpy as np\nimpo"
  },
  {
    "path": "code/gbm_model291.py",
    "chars": 5342,
    "preview": "import lightgbm as lgbm\nfrom scipy import sparse as ssp\nfrom sklearn.model_selection import StratifiedKFold\nimport numpy"
  },
  {
    "path": "code/nn_model290.py",
    "chars": 7688,
    "preview": "from keras.layers import Dense, Dropout, Embedding, Flatten, Input, merge\nfrom keras.layers.normalization import BatchNo"
  },
  {
    "path": "code/simple_average.py",
    "chars": 439,
    "preview": "'''\nsimple average of two models to get 2nd place\n'''\nimport pandas as pd\nkeras5_test = pd.read_csv(\"../model/keras5_pre"
  },
  {
    "path": "code/util.py",
    "chars": 4593,
    "preview": "import numpy as np\nimport pandas as pd\n\ndef Gini(y_true, y_pred):\n    # check and get number of samples\n    assert y_tru"
  },
  {
    "path": "code_for_exact_solution/keras3.py",
    "chars": 6283,
    "preview": "import os\nimport sys\nimport operator\nimport numpy as np\nimport pandas as pd\nfrom scipy import sparse\nimport xgboost as x"
  },
  {
    "path": "code_for_exact_solution/keras6.py",
    "chars": 10763,
    "preview": "import os\nimport sys\nimport operator\nimport numpy as np\nimport pandas as pd\nfrom scipy import sparse\nimport xgboost as x"
  },
  {
    "path": "code_for_exact_solution/keras7.py",
    "chars": 10146,
    "preview": "import os\nimport sys\nimport operator\nimport numpy as np\nimport pandas as pd\nfrom scipy import sparse\nimport xgboost as x"
  },
  {
    "path": "code_for_exact_solution/lightgbm1.py",
    "chars": 4446,
    "preview": "'''\nsimple xgboost benchmark\n'''\nimport lightgbm as lgbm\nfrom sklearn.model_selection import StratifiedKFold\nimport nump"
  },
  {
    "path": "code_for_exact_solution/lightgbm5.py",
    "chars": 6505,
    "preview": "import numpy as np\nimport scipy as sp\nfrom scipy import sparse as ssp\nimport pandas as pd\nfrom sklearn.preprocessing imp"
  },
  {
    "path": "code_for_exact_solution/lightgbm6.py",
    "chars": 6502,
    "preview": "import numpy as np\nimport scipy as sp\nfrom scipy import sparse as ssp\nimport pandas as pd\nfrom sklearn.preprocessing imp"
  },
  {
    "path": "code_for_exact_solution/lightgbm7.py",
    "chars": 6504,
    "preview": "import numpy as np\nimport scipy as sp\nfrom scipy import sparse as ssp\nimport pandas as pd\nfrom sklearn.preprocessing imp"
  },
  {
    "path": "code_for_exact_solution/lightgbm8.py",
    "chars": 9058,
    "preview": "import numpy as np\nimport scipy as sp\nfrom scipy import sparse as ssp\nimport pandas as pd\nfrom sklearn.preprocessing imp"
  },
  {
    "path": "code_for_exact_solution/logistic1.py",
    "chars": 5561,
    "preview": "'''\nsimple xgboost benchmark\n'''\nfrom sklearn.model_selection import StratifiedKFold, KFold\nfrom sklearn.feature_selecti"
  },
  {
    "path": "code_for_exact_solution/rank_average.py",
    "chars": 3001,
    "preview": "import pandas as pd\nfrom util import Gini\n\ndef get_rank(x):\n    return pd.Series(x).rank(pct=True).values\n\ntrain = pd.re"
  },
  {
    "path": "code_for_exact_solution/util.py",
    "chars": 4593,
    "preview": "import numpy as np\nimport pandas as pd\n\ndef Gini(y_true, y_pred):\n    # check and get number of samples\n    assert y_tru"
  },
  {
    "path": "code_for_exact_solution/xgb0.py",
    "chars": 5950,
    "preview": "'''\nsimple xgboost benchmark\n'''\nimport xgboost as xgb\nfrom sklearn.model_selection import StratifiedKFold, KFold\nimport"
  },
  {
    "path": "code_for_exact_solution/xgb_linear0.py",
    "chars": 7121,
    "preview": "import os\nimport sys\nimport operator\nimport numpy as np\nimport pandas as pd\nfrom scipy import sparse\nimport xgboost as x"
  },
  {
    "path": "input/readme",
    "chars": 23,
    "preview": "unzipped data goes here"
  },
  {
    "path": "model/readme",
    "chars": 25,
    "preview": "models will be saved here"
  },
  {
    "path": "readme.md",
    "chars": 1299,
    "preview": "### Requirements\n\n*older or newer version of below packages should theoretically work fine*\n\npython 2.7\n\nnumpy 1.13.3\n\np"
  }
]

About this extraction

This page contains the full source code of the xiaozhouwang/kaggle-porto-seguro GitHub repository, extracted and formatted as plain text for AI agents and large language models (LLMs). The extraction includes 26 files (158.5 KB), approximately 48.6k tokens, and a symbol index with 40 extracted functions, classes, methods, constants, and types. Use this with OpenClaw, Claude, ChatGPT, Cursor, Windsurf, or any other AI tool that accepts text input. You can copy the full output to your clipboard or download it as a .txt file.

Extracted by GitExtract — free GitHub repo to text converter for AI. Built by Nikandr Surkov.

Copied to clipboard!