Repository: xiaozhouwang/kaggle-porto-seguro
Branch: master
Commit: 83d794f6dce6
Files: 26
Total size: 158.5 KB
Directory structure:
gitextract_dcpwav_6/
├── Jupyter_nnmodel/
│ ├── README.md
│ ├── fea_eng0.py
│ ├── feature_generater.py
│ ├── nn_model .ipynb
│ └── util.py
├── code/
│ ├── fea_eng0.py
│ ├── gbm_model291.py
│ ├── nn_model290.py
│ ├── simple_average.py
│ └── util.py
├── code_for_exact_solution/
│ ├── keras3.py
│ ├── keras6.py
│ ├── keras7.py
│ ├── lightgbm1.py
│ ├── lightgbm5.py
│ ├── lightgbm6.py
│ ├── lightgbm7.py
│ ├── lightgbm8.py
│ ├── logistic1.py
│ ├── rank_average.py
│ ├── util.py
│ ├── xgb0.py
│ └── xgb_linear0.py
├── input/
│ └── readme
├── model/
│ └── readme
└── readme.md
================================================
FILE CONTENTS
================================================
================================================
FILE: Jupyter_nnmodel/README.md
================================================
# Content: Jupyter Version of 2nd place code kaggle-porto-seguro
## [Porto Seguro’s Safe Driver Prediction](https://www.kaggle.com/c/porto-seguro-safe-driver-prediction) in Kaggle Competition
### Popose
The popose of this jupyter script is to make flow of code readable and easy to understand. Some code have been changed from the orignial author, but the concept of processing the 2nd place code is exactly same. For the some feature engineering processing, I put these codes into `feature_generater.py`
### Install
This project requires **Python 2.7 or 3.6** and the following Python libraries installed:
- [NumPy](http://www.numpy.org/)
- [Pandas](http://pandas.pydata.org)
- [Keras](http://matplotlib.org/)
- [scikit-learn](http://scikit-learn.org/stable/)
- [xgboost](https://xgboost.readthedocs.io/)
- [pickle](https://www.tensorflow.org/)
- [keras](https://keras.io/)
- [itertools](https://docs.python.org/2/library/itertools.html)
You will also need to have software installed to run and execute a [Jupyter Notebook](http://ipython.org/notebook.html)
If you do not have Python installed yet, it is highly recommended that you install the [Anaconda](http://continuum.io/downloads) distribution of Python, which already has the above packages and more included. Make sure that you select the Python 2.7 installer and not the Python 3.x installer.
### Code
1. Run `fea_eng0.py` to get your first features as a pickle file `fea0.pk`.
Note: if you are using python 3, you need to switch the code, `fea_eng0.py`, in the last line. So the pickle file would be able to read in `nn_model.ipynb`.
2. the manin code is provided in the `nn_model.ipynb` notebook file. You will also be required to use the included `util.py` and `feature_generater.py` Python files, the `train.csv` and `test.csv` dataset file,which you have to download from [Kaggle](https://www.kaggle.com/c/porto-seguro-safe-driver-prediction/data) into the input folder, to complete your work. While some code has already been implemented to get you started, you will need to implement additional functionality when requested to successfully complete the. During the operation of `nn_model.ipynb` in , the defualt output will be created in model folder. If you are interested in `util.py` and `feature_generater.py`, please feel free to explore these Python files.
### Run
In a terminal or command window, navigate to the top-level project directory `Jupyter_Version/` (that contains this README) and run one of the following commands:
```bash
ipython notebook nn_model.ipynb
```
or
```bash
jupyter notebook nn_model.ipynb
```
This will open the Jupyter Notebook software and project file in your browser.
## Data
You can download data from [Kaggle](https://www.kaggle.com/c/porto-seguro-safe-driver-prediction/data)
In the train and test data, features that belong to similar groupings are tagged as such in the feature names (e.g., ind, reg, car, calc). In addition, feature names include the postfix bin to indicate binary features and cat to indicate categorical features. Features without these designations are either continuous or ordinal. Values of -1 indicate that the feature was missing from the observation. The target columns signifies whether or not a claim was filed for that policy holder.
================================================
FILE: Jupyter_nnmodel/fea_eng0.py
================================================
"""xgb prediction as features"""
import xgboost as xgb
from sklearn.model_selection import KFold
import numpy as np
import pandas as pd
eta = 0.1
max_depth = 6
subsample = 0.9
colsample_bytree = 0.85
min_child_weight = 55
num_boost_round = 500
train = pd.read_csv("../input/train.csv")
train_label = train['target']
train_id = train['id']
del train['target'], train['id']
test = pd.read_csv("../input/test.csv")
test_id = test['id']
del test['id']
params = {"objective": "reg:linear",
"booster": "gbtree",
"eta": eta,
"max_depth": int(max_depth),
"subsample": subsample,
"colsample_bytree": colsample_bytree,
"min_child_weight": min_child_weight,
"silent": 1
}
data = train.append(test)
data.reset_index(inplace=True)
train_rows = train.shape[0]
feature_results = []
for target_g in ['car', 'ind', 'reg']:
features = [x for x in list(data) if target_g not in x]
target_list = [x for x in list(data) if target_g in x]
train_fea = np.array(data[features])
for target in target_list:
print(target)
train_label = data[target]
kfold = KFold(n_splits=5, random_state=218, shuffle=True)
kf = kfold.split(data)
cv_train = np.zeros(shape=(data.shape[0], 1))
for i, (train_fold, validate) in enumerate(kf):
X_train, X_validate, label_train, label_validate = \
train_fea[train_fold, :], train_fea[validate, :], train_label[train_fold], train_label[validate]
dtrain = xgb.DMatrix(X_train, label_train)
dvalid = xgb.DMatrix(X_validate, label_validate)
watchlist = [(dtrain, 'train'), (dvalid, 'valid')]
bst = xgb.train(params, dtrain, num_boost_round, evals=watchlist, verbose_eval=50,
early_stopping_rounds=10)
cv_train[validate, 0] += bst.predict(xgb.DMatrix(X_validate), ntree_limit=bst.best_ntree_limit)
feature_results.append(cv_train)
feature_results = np.hstack(feature_results)
train_features = feature_results[:train_rows, :]
test_features = feature_results[train_rows:, :]
import pickle
#for python 2
pickle.dump([train_features, test_features], open("../input/fea0.pk", 'wb'),protocol=2)
#for python 3
# pickle.dump([train_features, test_features], open("../input/fea0.pk", 'wb'),protocol=3)
================================================
FILE: Jupyter_nnmodel/feature_generater.py
================================================
from util import proj_num_on_cat, Gini, interaction_features
from itertools import combinations
import numpy as np
import pandas as pd
def Multiply_Divide(train, test, features):
"""
combinations:
combinations(['A', 'B','C'],2) retrun AB AC BC
combinations(range(4), 3) --> 012 013 023 123
"""
feature_names= []
for e, (x, y) in enumerate(combinations(features, 2)):
train, test, feature_name= interaction_features(train, test, x, y, e)
for name in feature_name:
feature_names.append(name)
return train, test, feature_names
def Series_string(train, test, category_list):
'''
produce series as a string like new_ind as the following
id new_ind_count new_ind
595207 117 3_1_10_0_0_0_0_0_1_0_0_0_0_0_13_1_0_0
595208 153 5_1_3_0_0_0_0_0_1_0_0_0_0_0_6_1_0_0
return train and test with new colunes of new_categories
'''
for category in category_list:
feature_names = list(train.columns)
features = [c for c in feature_names if category in c]
name= 'new_'+ category
count = 0
for c in features:
if count == 0:
train[name] = train[c].astype(str)
count += 1
else:
train[name] += '_' + train[c].astype(str)
count = 0
for c in features:
if count == 0:
test[name] = test[c].astype(str)
count += 1
else:
test[name] += '_' + test[c].astype(str)
return train, test
def Features_Counts(train, test, features):
feature_names =[]
for c in features:
d = pd.concat([train[c],test[c]]).value_counts().to_dict()
train['%s_count'%c] = train[c].apply(lambda x:d.get(x,0))
test['%s_count'%c] = test[c].apply(lambda x:d.get(x,0))
feature_names.append('%s_count'%c)
return train, test, feature_names
def Statistic_features(train, test, target_features, group_features):
train_list_=[]
test_list_=[]
for t in target_features:
for g in group_features:
if t != g:
s_train, s_test = proj_num_on_cat(train, test, target_column=t, group_column=g)
train_list_.append(s_train)
test_list_.append(s_test)
return np.hstack(train_list_), np.hstack(test_list_)
def features_type(train):
data = []
for f in train.columns:
# Defining the level
if 'bin' in f or f == 'target':
level = 'binary'
elif 'cat' in f or f == 'id':
level = 'nominal'
elif train[f].dtype == float:
level = 'interval'
elif train[f].dtype == int:
level = 'ordinal'
# Initialize keep to True for all variables except for id
keep = True
if f == 'id':
keep = False
# Defining the data type
dtype = train[f].dtype
# Creating a Dict that contains all the metadata for the variable
f_dict = {
'varname': f,
'level': level,
'keep': keep,
'dtype': dtype
}
data.append(f_dict)
meta = pd.DataFrame(data, columns=['varname', 'level', 'keep', 'dtype'])
meta.set_index('varname', inplace=True)
interval = meta[(meta.level == 'interval') & (meta.keep)].index
ordinal = meta[(meta.level == 'ordinal') & (meta.keep)].index
binary = meta[(meta.level == 'binary') & (meta.keep)].index
nominal = meta[(meta.level == 'nominal') & (meta.keep)].index
return interval, ordinal, binary, nominal
================================================
FILE: Jupyter_nnmodel/nn_model .ipynb
================================================
{
"cells": [
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/Users/stevenhu/anaconda3/lib/python3.6/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.\n",
" from ._conv import register_converters as _register_converters\n",
"Using TensorFlow backend.\n"
]
}
],
"source": [
"#homemade script\n",
"from util import Gini\n",
"from feature_generater import Multiply_Divide, Series_string, Features_Counts, Statistic_features\n",
"\n",
"import numpy as np\n",
"import pandas as pd\n",
"import pickle\n",
"from scipy import sparse\n",
"from sklearn.preprocessing import StandardScaler\n",
"from sklearn.preprocessing import LabelEncoder\n",
"\n",
"#for NN model\n",
"from keras.layers import Dense, Dropout, Embedding, Flatten, Input, Concatenate, merge\n",
"from keras.layers.normalization import BatchNormalization\n",
"from keras.layers.advanced_activations import PReLU\n",
"from keras.models import Model\n",
"from time import time\n",
"import datetime\n",
"from sklearn.model_selection import StratifiedKFold\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 1. Load Data #\n",
"\n",
"- Create train, test dataset\n",
"- Create train target label\n",
"- Create feature object: cat, num, bin, inter\n",
"- Create feature columns in train: counting of miss values\n"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"cv_only = True\n",
"save_cv = True\n",
"\n",
"#read data\n",
"train = pd.read_csv(\"../input/train.csv\")\n",
"train_label = train['target']\n",
"train_id = train['id']\n",
"del train['target'], train['id']\n",
"\n",
"test = pd.read_csv(\"../input/test.csv\")\n",
"test_id = test['id']\n",
"del test['id']\n",
"\n",
"\n",
"\n",
"#find missing value by each row and recode to column 'missing'\n",
"train['missing'] = (train==-1).sum(axis=1).astype(float)\n",
"test['missing'] = (test==-1).sum(axis=1).astype(float)\n",
"\n",
"#get all featrue name\n",
"feature_names = list(train)\n",
"\n",
"# extract feature with cat, bin, num, inter\n",
"cat_fea = [x for x in list(train) if 'cat' in x]\n",
"bin_fea = [x for x in list(train) if 'bin' in x]\n",
"num_features = [c for c in list(train) if ('cat' not in c and 'calc' not in c)]\n",
"inter_fea = [x for x in list(train) if 'inter' in x]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 2. Feature Engineering #\n",
"\n",
"- Create Multipy and Divide feature\n",
"- Feature of Counts(target incoding)\n",
"- Load feature generated from Feature Engine\n",
"- Create Statistic features\n",
"- Combine all feature together, and get ready for training\n",
"- Create Cat_feature for NN embeding training"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 2.1 Multipy and Divide feature ##\n",
"- moltipy each feature in the list and created new columns into train and testing data set"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"#Add features of Multiply and Divide\n",
"features= ['ps_car_13', 'ps_ind_03', 'ps_reg_03', 'ps_ind_15', 'ps_reg_01', 'ps_ind_01']\n",
"train, test, MD_features = Multiply_Divide(train, test, features)\n"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>inter_0*</th>\n",
" <th>inter_0/</th>\n",
" <th>inter_1*</th>\n",
" <th>inter_1/</th>\n",
" <th>inter_2*</th>\n",
" <th>inter_2/</th>\n",
" <th>inter_3*</th>\n",
" <th>inter_3/</th>\n",
" <th>inter_4*</th>\n",
" <th>inter_4/</th>\n",
" <th>...</th>\n",
" <th>inter_10*</th>\n",
" <th>inter_10/</th>\n",
" <th>inter_11*</th>\n",
" <th>inter_11/</th>\n",
" <th>inter_12*</th>\n",
" <th>inter_12/</th>\n",
" <th>inter_13*</th>\n",
" <th>inter_13/</th>\n",
" <th>inter_14*</th>\n",
" <th>inter_14/</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>4.418395</td>\n",
" <td>0.176736</td>\n",
" <td>0.634544</td>\n",
" <td>1.230630</td>\n",
" <td>9.720468</td>\n",
" <td>0.080334</td>\n",
" <td>0.618575</td>\n",
" <td>1.262398</td>\n",
" <td>1.767358</td>\n",
" <td>0.441839</td>\n",
" <td>...</td>\n",
" <td>0.502649</td>\n",
" <td>1.025815</td>\n",
" <td>1.436141</td>\n",
" <td>0.359035</td>\n",
" <td>7.7</td>\n",
" <td>15.714286</td>\n",
" <td>22</td>\n",
" <td>5.500000</td>\n",
" <td>1.4</td>\n",
" <td>0.350000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>4.331716</td>\n",
" <td>0.088402</td>\n",
" <td>0.474062</td>\n",
" <td>0.807773</td>\n",
" <td>1.856450</td>\n",
" <td>0.206272</td>\n",
" <td>0.495053</td>\n",
" <td>0.773521</td>\n",
" <td>0.618817</td>\n",
" <td>0.618817</td>\n",
" <td>...</td>\n",
" <td>0.612862</td>\n",
" <td>0.957597</td>\n",
" <td>0.766078</td>\n",
" <td>0.766078</td>\n",
" <td>2.4</td>\n",
" <td>3.750000</td>\n",
" <td>3</td>\n",
" <td>3.000000</td>\n",
" <td>0.8</td>\n",
" <td>0.800000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>5.774271</td>\n",
" <td>0.071287</td>\n",
" <td>-0.641586</td>\n",
" <td>-0.641586</td>\n",
" <td>7.699029</td>\n",
" <td>0.053465</td>\n",
" <td>0.000000</td>\n",
" <td>inf</td>\n",
" <td>3.207929</td>\n",
" <td>0.128317</td>\n",
" <td>...</td>\n",
" <td>-0.000000</td>\n",
" <td>-inf</td>\n",
" <td>-5.000000</td>\n",
" <td>-0.200000</td>\n",
" <td>0.0</td>\n",
" <td>inf</td>\n",
" <td>60</td>\n",
" <td>2.400000</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>1.085898</td>\n",
" <td>0.271474</td>\n",
" <td>0.315425</td>\n",
" <td>0.934592</td>\n",
" <td>4.343590</td>\n",
" <td>0.067869</td>\n",
" <td>0.488654</td>\n",
" <td>0.603276</td>\n",
" <td>0.000000</td>\n",
" <td>inf</td>\n",
" <td>...</td>\n",
" <td>0.522853</td>\n",
" <td>0.645497</td>\n",
" <td>0.000000</td>\n",
" <td>inf</td>\n",
" <td>7.2</td>\n",
" <td>8.888889</td>\n",
" <td>0</td>\n",
" <td>inf</td>\n",
" <td>0.0</td>\n",
" <td>inf</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>0.000000</td>\n",
" <td>inf</td>\n",
" <td>0.475728</td>\n",
" <td>0.673001</td>\n",
" <td>5.092484</td>\n",
" <td>0.062870</td>\n",
" <td>0.396082</td>\n",
" <td>0.808331</td>\n",
" <td>0.000000</td>\n",
" <td>inf</td>\n",
" <td>...</td>\n",
" <td>0.588531</td>\n",
" <td>1.201084</td>\n",
" <td>0.000000</td>\n",
" <td>inf</td>\n",
" <td>6.3</td>\n",
" <td>12.857143</td>\n",
" <td>0</td>\n",
" <td>inf</td>\n",
" <td>0.0</td>\n",
" <td>inf</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>5 rows × 30 columns</p>\n",
"</div>"
],
"text/plain": [
" inter_0* inter_0/ inter_1* inter_1/ inter_2* inter_2/ inter_3* \\\n",
"0 4.418395 0.176736 0.634544 1.230630 9.720468 0.080334 0.618575 \n",
"1 4.331716 0.088402 0.474062 0.807773 1.856450 0.206272 0.495053 \n",
"2 5.774271 0.071287 -0.641586 -0.641586 7.699029 0.053465 0.000000 \n",
"3 1.085898 0.271474 0.315425 0.934592 4.343590 0.067869 0.488654 \n",
"4 0.000000 inf 0.475728 0.673001 5.092484 0.062870 0.396082 \n",
"\n",
" inter_3/ inter_4* inter_4/ ... inter_10* inter_10/ inter_11* \\\n",
"0 1.262398 1.767358 0.441839 ... 0.502649 1.025815 1.436141 \n",
"1 0.773521 0.618817 0.618817 ... 0.612862 0.957597 0.766078 \n",
"2 inf 3.207929 0.128317 ... -0.000000 -inf -5.000000 \n",
"3 0.603276 0.000000 inf ... 0.522853 0.645497 0.000000 \n",
"4 0.808331 0.000000 inf ... 0.588531 1.201084 0.000000 \n",
"\n",
" inter_11/ inter_12* inter_12/ inter_13* inter_13/ inter_14* inter_14/ \n",
"0 0.359035 7.7 15.714286 22 5.500000 1.4 0.350000 \n",
"1 0.766078 2.4 3.750000 3 3.000000 0.8 0.800000 \n",
"2 -0.200000 0.0 inf 60 2.400000 0.0 0.000000 \n",
"3 inf 7.2 8.888889 0 inf 0.0 inf \n",
"4 inf 6.3 12.857143 0 inf 0.0 inf \n",
"\n",
"[5 rows x 30 columns]"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"display(train[MD_features].head(5))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 2.2 Feature of Counts ##\n",
"1. Generate new_ind, new_reg, new_car\n",
"2. Count the number of distinct values of\n",
" cat features, new_ind, new_reg and new_car"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"\n",
"'''\n",
"create 1_0_1_1..... data as new_xxx\n",
"\n",
"new_ind: collect all data from all relative \"ind\" columns, then generate series number\n",
"\n",
"new_reg, new_car for train and test data \n",
"For RNN processing, generating a sequence number\n",
"'''\n",
"\n",
"\n",
"category_list = ['ind', 'reg', 'car']\n",
"#add 'new_ind','new_reg','new_car' in train and test dataset\n",
"train, test = Series_string(train,test,category_list )\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>new_ind</th>\n",
" <th>new_reg</th>\n",
" <th>new_car</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>2_2_5_1_0_0_1_0_0_0_0_0_0_0_11_0_1_0</td>\n",
" <td>0.7_0.2_0.7180703308</td>\n",
" <td>10_1_-1_0_1_4_1_0_0_1_12_2_0.4_0.8836789178_0....</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>1_1_7_0_0_0_0_1_0_0_0_0_0_0_3_0_0_1</td>\n",
" <td>0.8_0.4_0.7660776723</td>\n",
" <td>11_1_-1_0_-1_11_1_1_2_1_19_3_0.316227766_0.618...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>5_4_9_1_0_0_0_1_0_0_0_0_0_0_12_1_0_0</td>\n",
" <td>0.0_0.0_-1.0</td>\n",
" <td>7_1_-1_0_-1_14_1_1_2_1_60_1_0.316227766_0.6415...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>0_1_2_0_0_1_0_0_0_0_0_0_0_0_8_1_0_0</td>\n",
" <td>0.9_0.2_0.5809475019</td>\n",
" <td>7_1_0_0_1_11_1_1_3_1_104_1_0.3741657387_0.5429...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>0_2_0_1_0_1_0_0_0_0_0_0_0_0_9_1_0_0</td>\n",
" <td>0.7_0.6_0.840758586</td>\n",
" <td>11_1_-1_0_-1_14_1_1_2_1_82_3_0.3160696126_0.56...</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" new_ind new_reg \\\n",
"0 2_2_5_1_0_0_1_0_0_0_0_0_0_0_11_0_1_0 0.7_0.2_0.7180703308 \n",
"1 1_1_7_0_0_0_0_1_0_0_0_0_0_0_3_0_0_1 0.8_0.4_0.7660776723 \n",
"2 5_4_9_1_0_0_0_1_0_0_0_0_0_0_12_1_0_0 0.0_0.0_-1.0 \n",
"3 0_1_2_0_0_1_0_0_0_0_0_0_0_0_8_1_0_0 0.9_0.2_0.5809475019 \n",
"4 0_2_0_1_0_1_0_0_0_0_0_0_0_0_9_1_0_0 0.7_0.6_0.840758586 \n",
"\n",
" new_car \n",
"0 10_1_-1_0_1_4_1_0_0_1_12_2_0.4_0.8836789178_0.... \n",
"1 11_1_-1_0_-1_11_1_1_2_1_19_3_0.316227766_0.618... \n",
"2 7_1_-1_0_-1_14_1_1_2_1_60_1_0.316227766_0.6415... \n",
"3 7_1_0_0_1_11_1_1_3_1_104_1_0.3741657387_0.5429... \n",
"4 11_1_-1_0_-1_14_1_1_2_1_82_3_0.3160696126_0.56... "
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"display(train[['new_ind','new_reg','new_car']].head(5))"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"'''\n",
"count_features\n",
"\n",
"preparing for train[cat_count_features] \n",
"cat_fea = \n",
"['ps_ind_02_cat','ps_ind_04_cat','ps_ind_05_cat', 'ps_car_01_cat', 'ps_car_02_cat',\n",
" 'ps_car_03_cat','ps_car_04_cat','ps_car_05_cat','ps_car_06_cat','ps_car_07_cat',\n",
" 'ps_car_08_cat','ps_car_09_cat','ps_car_10_cat', 'ps_car_11_cat']\n",
"\n",
"Example: \n",
"ps_ind_02_cat_count\n",
"dictionay of ps_ind_02_cat \n",
"([(1, 1079327), (2, 309747), (3, 70172), (4, 28259), (-1, 523)])\n",
"\n",
"row count origial value\n",
"595202 1079327 1 \n",
"595203 309747 2\n",
"595204 309747 2\n",
"595205 70172 3\n",
"595206 1079327 1\n",
"\n",
"''' \n",
"\n",
"cat_fea = [ name for name in list(train) if 'cat' in name and 'count' not in name]\n",
"features= cat_fea + ['new_ind','new_reg','new_car']\n",
"\n",
"train, test, cat_count_features= Features_Counts(train, test, features)\n",
"\n",
"\n",
" "
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>ps_ind_02_cat_count</th>\n",
" <th>ps_ind_04_cat_count</th>\n",
" <th>ps_ind_05_cat_count</th>\n",
" <th>ps_car_01_cat_count</th>\n",
" <th>ps_car_02_cat_count</th>\n",
" <th>ps_car_03_cat_count</th>\n",
" <th>ps_car_04_cat_count</th>\n",
" <th>ps_car_05_cat_count</th>\n",
" <th>ps_car_06_cat_count</th>\n",
" <th>ps_car_07_cat_count</th>\n",
" <th>ps_car_08_cat_count</th>\n",
" <th>ps_car_09_cat_count</th>\n",
" <th>ps_car_10_cat_count</th>\n",
" <th>ps_car_11_cat_count</th>\n",
" <th>new_ind_count</th>\n",
" <th>new_reg_count</th>\n",
" <th>new_car_count</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>309747</td>\n",
" <td>620936</td>\n",
" <td>1319412</td>\n",
" <td>124587</td>\n",
" <td>1234979</td>\n",
" <td>1028142</td>\n",
" <td>1241334</td>\n",
" <td>431560</td>\n",
" <td>77845</td>\n",
" <td>1383070</td>\n",
" <td>249663</td>\n",
" <td>486510</td>\n",
" <td>1475460</td>\n",
" <td>18326</td>\n",
" <td>6</td>\n",
" <td>24</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>1079327</td>\n",
" <td>866864</td>\n",
" <td>1319412</td>\n",
" <td>518725</td>\n",
" <td>1234979</td>\n",
" <td>1028142</td>\n",
" <td>1241334</td>\n",
" <td>666910</td>\n",
" <td>329890</td>\n",
" <td>1383070</td>\n",
" <td>1238365</td>\n",
" <td>883326</td>\n",
" <td>1475460</td>\n",
" <td>12535</td>\n",
" <td>36</td>\n",
" <td>38</td>\n",
" <td>11</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>28259</td>\n",
" <td>620936</td>\n",
" <td>1319412</td>\n",
" <td>449617</td>\n",
" <td>1234979</td>\n",
" <td>1028142</td>\n",
" <td>1241334</td>\n",
" <td>666910</td>\n",
" <td>147714</td>\n",
" <td>1383070</td>\n",
" <td>1238365</td>\n",
" <td>883326</td>\n",
" <td>1475460</td>\n",
" <td>19943</td>\n",
" <td>24</td>\n",
" <td>13477</td>\n",
" <td>40</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>1079327</td>\n",
" <td>866864</td>\n",
" <td>1319412</td>\n",
" <td>449617</td>\n",
" <td>1234979</td>\n",
" <td>183044</td>\n",
" <td>1241334</td>\n",
" <td>431560</td>\n",
" <td>329890</td>\n",
" <td>1383070</td>\n",
" <td>1238365</td>\n",
" <td>36798</td>\n",
" <td>1475460</td>\n",
" <td>212989</td>\n",
" <td>2784</td>\n",
" <td>222</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>309747</td>\n",
" <td>620936</td>\n",
" <td>1319412</td>\n",
" <td>518725</td>\n",
" <td>1234979</td>\n",
" <td>1028142</td>\n",
" <td>1241334</td>\n",
" <td>666910</td>\n",
" <td>147714</td>\n",
" <td>1383070</td>\n",
" <td>1238365</td>\n",
" <td>883326</td>\n",
" <td>1475460</td>\n",
" <td>26161</td>\n",
" <td>258</td>\n",
" <td>34</td>\n",
" <td>13</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" ps_ind_02_cat_count ps_ind_04_cat_count ps_ind_05_cat_count \\\n",
"0 309747 620936 1319412 \n",
"1 1079327 866864 1319412 \n",
"2 28259 620936 1319412 \n",
"3 1079327 866864 1319412 \n",
"4 309747 620936 1319412 \n",
"\n",
" ps_car_01_cat_count ps_car_02_cat_count ps_car_03_cat_count \\\n",
"0 124587 1234979 1028142 \n",
"1 518725 1234979 1028142 \n",
"2 449617 1234979 1028142 \n",
"3 449617 1234979 183044 \n",
"4 518725 1234979 1028142 \n",
"\n",
" ps_car_04_cat_count ps_car_05_cat_count ps_car_06_cat_count \\\n",
"0 1241334 431560 77845 \n",
"1 1241334 666910 329890 \n",
"2 1241334 666910 147714 \n",
"3 1241334 431560 329890 \n",
"4 1241334 666910 147714 \n",
"\n",
" ps_car_07_cat_count ps_car_08_cat_count ps_car_09_cat_count \\\n",
"0 1383070 249663 486510 \n",
"1 1383070 1238365 883326 \n",
"2 1383070 1238365 883326 \n",
"3 1383070 1238365 36798 \n",
"4 1383070 1238365 883326 \n",
"\n",
" ps_car_10_cat_count ps_car_11_cat_count new_ind_count new_reg_count \\\n",
"0 1475460 18326 6 24 \n",
"1 1475460 12535 36 38 \n",
"2 1475460 19943 24 13477 \n",
"3 1475460 212989 2784 222 \n",
"4 1475460 26161 258 34 \n",
"\n",
" new_car_count \n",
"0 1 \n",
"1 11 \n",
"2 40 \n",
"3 1 \n",
"4 13 "
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# display(train[['new_ind','new_reg','new_car']].head(5))\n",
"display(train[cat_count_features].head(5))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 2.3 Get the feature from feature training ## "
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"train_fea0, test_fea0 = pickle.load(open(\"../input/fea0.pk\",'rb'), encoding='iso-8859-1')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 2.4 Statistic features ##\n",
"\n",
"- find the feature of median, mean and standard deviation"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [],
"source": [
"#feature aggregation\n",
"target_features = ['ps_car_13', 'ps_ind_03', 'ps_reg_03', 'ps_ind_15', 'ps_reg_01', 'ps_ind_01']\n",
"group_features = ['ps_car_13', 'ps_ind_03', 'ps_reg_03', 'ps_ind_15', 'ps_reg_01', 'ps_ind_01', 'ps_ind_05_cat']\n",
"\n",
"#return numpy because we need to do np.hstack to merge all statistic feature together, so that it would return np array\n",
"train_statis, test_statis = Statistic_features(train, test, target_features, group_features)"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([[ 1.57043000e+05, 8.32786207e-01, 2.41530046e-01, ...,\n",
" 1.00000000e+00, 7.00000000e+00, 0.00000000e+00],\n",
" [ 1.30452000e+05, 8.26528390e-01, 2.35133348e-01, ...,\n",
" 1.00000000e+00, 7.00000000e+00, 0.00000000e+00],\n",
" [ 6.35510000e+04, 8.13168936e-01, 2.35946815e-01, ...,\n",
" 1.00000000e+00, 7.00000000e+00, 0.00000000e+00],\n",
" ..., \n",
" [ 3.58630000e+04, 8.00633360e-01, 2.34463222e-01, ...,\n",
" 1.00000000e+00, 7.00000000e+00, 0.00000000e+00],\n",
" [ 2.04836000e+05, 8.24270444e-01, 2.28649975e-01, ...,\n",
" 1.00000000e+00, 7.00000000e+00, 0.00000000e+00],\n",
" [ 9.85280000e+04, 8.24453229e-01, 2.37806003e-01, ...,\n",
" 1.00000000e+00, 7.00000000e+00, 0.00000000e+00]])"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"display(train_statis)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 2.5 Combine all feature together, and get ready for training ##\n",
"\n",
"1. merge features into train_list & test_list that would like to dump into NN model\n",
"2. training a scaler by sparse that generated by train_list & test_list\n",
"4. convert train_list & test_list into X , X_test, which has been scaled"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [],
"source": [
"'''\n",
"Building a train list including train_num, cat_count_features, statistic feature and infered feature training by XGboost.\n",
"\n",
"train_num: training data without set of cat_calc\n",
"cat_count_features: cat_fea + ['new_ind','new_reg','new_car']\n",
"train_fea0: feature extraction \n",
"'''\n",
"\n",
"\n",
"#training data without set of cat_calc\n",
"train_num = train[[x for x in list(train) if x in num_features]]\n",
"test_num = test[[x for x in list(train) if x in num_features]]\n",
"\n",
"train_list = [train_num.replace([np.inf, -np.inf, np.nan], 0), train[cat_count_features], train_statis, train_fea0 ]\n",
"test_list = [test_num.replace([np.inf, -np.inf, np.nan], 0), test[cat_count_features], test_statis,test_fea0] "
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"'''\n",
"X are stacked from 5 features\n",
"1. train_num(595212,54): training data without set of cat_calc\n",
"2. cat_count_features(595212,17): cat_fea + ['new_ind','new_reg','new_car']\n",
"3. feature statis(595212,6) * 36\n",
"4. train_fea0(595212, 38): feature extraction\n",
"\n",
"all_data (595212, 235)\n",
"'''\n",
"\n",
"\n",
"X = sparse.hstack(train_list).tocsr()\n",
"X_test = sparse.hstack(test_list).tocsr()\n",
"\n",
"all_data = np.vstack([X.toarray(), X_test.toarray()])\n",
"scaler = StandardScaler()\n",
"scaler.fit(all_data)\n",
"X = scaler.transform(X.toarray())\n",
"X_test = scaler.transform(X_test.toarray())\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 2.6 Create Cat_feature for NN embeding training ##\n",
" Don't ask why they doing this! they only tell you what is this\n",
" \n",
" - in the feature NN model of the finial testing data, you would need the list, which likes **[[cat_featrue], X]**\n",
" or **[['ps_ind_02_cat', 'ps_ind_04_cat', 'ps_ind_05_cat',.....,'ps_car_11_cat'], X]**\n",
" \n",
" **_This is to process the above testing data. If you could not understand, that is fine, and just look the next steps_**"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/Users/stevenhu/anaconda3/envs/run/lib/python3.6/site-packages/pandas/core/indexing.py:630: SettingWithCopyWarning: \n",
"A value is trying to be set on a copy of a slice from a DataFrame.\n",
"Try using .loc[row_indexer,col_indexer] = value instead\n",
"\n",
"See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy\n",
" self.obj[item_labels[indexer[info_axis]]] = value\n"
]
}
],
"source": [
"#preparing for training cat \n",
"train_cat = train[cat_fea]\n",
"test_cat = test[cat_fea]\n",
"\n",
"# convert pd to np.array\n",
"X_cat = train_cat.values\n",
"tem = test_cat.values\n",
"\n",
"# storing the dimension for embedding layer as an input value\n",
"max_cat_values = []\n",
"\n",
"for c in cat_fea:\n",
" \n",
" #nomalize the label\n",
" #LabelEncoder: http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html\n",
" \n",
" le = LabelEncoder()\n",
" x = le.fit_transform(pd.concat([train_cat, test_cat])[c])\n",
" train_cat.loc[:,c] = le.transform(train_cat[c])\n",
" test_cat.loc[:,c] = le.transform(test_cat[c])\n",
" max_cat_values.append(np.max(x))\n",
"\n",
"# Build the final testing data\n",
"X_TEST_CAT = []\n",
"for i in range(tem.shape[1]):\n",
" X_TEST_CAT.append(tem[:, i].reshape(-1, 1))\n",
"X_TEST_CAT.append(X_test)\n"
]
},
{
"cell_type": "code",
"execution_count": 57,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"cat_fea: ['ps_ind_02_cat', 'ps_ind_04_cat', 'ps_ind_05_cat', 'ps_car_01_cat', 'ps_car_02_cat', 'ps_car_03_cat', 'ps_car_04_cat', 'ps_car_05_cat', 'ps_car_06_cat', 'ps_car_07_cat', 'ps_car_08_cat', 'ps_car_09_cat', 'ps_car_10_cat', 'ps_car_11_cat']\n",
"\n",
"max_cat_values: [4, 2, 7, 12, 2, 2, 9, 2, 17, 2, 1, 5, 2, 103]\n"
]
}
],
"source": [
"print('cat_fea:', cat_fea)\n",
"print('\\nmax_cat_values: ',max_cat_values)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 3. Training NN Model with Keras # "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"1. Build the model\n",
"2. training"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 3.1 Build the model ##"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### model structure: ###\n",
"<img src=\"Jupyter_image/NN_layer.png\">"
]
},
{
"cell_type": "code",
"execution_count": 47,
"metadata": {},
"outputs": [],
"source": [
"def nn_model():\n",
" inputs = []\n",
" flatten_layers = []\n",
" for e, c in enumerate(cat_fea):\n",
" input_c = Input(shape=(1, ), dtype='int32')\n",
" num_c = max_cat_values[e]\n",
" \n",
" # need to add 1, https://keras.io/layers/embeddings/\n",
" # **input_dim: int > 0. Size of the vocabulary, i.e. maximum integer index + 1.**\n",
" embed_c = Embedding(num_c+1,6,input_length=1)(input_c)\n",
" embed_c = Dropout(0.25)(embed_c)\n",
" flatten_c = Flatten()(embed_c)\n",
" inputs.append(input_c)\n",
" flatten_layers.append(flatten_c)\n",
" \n",
" \n",
" input_num = Input(shape=(X.shape[1],), dtype='float32')\n",
" inputs.append(input_num)\n",
" \n",
" #merge X and embedding layer\n",
" flatten_layers.append(input_num)\n",
" flatten = merge(flatten_layers, mode='concat')\n",
"\n",
" fc1 = Dense(512, kernel_initializer='he_normal')(flatten)\n",
" fc1 = PReLU()(fc1)\n",
" fc1 = BatchNormalization()(fc1)\n",
" fc1 = Dropout(0.75)(fc1)\n",
"\n",
" fc1 = Dense(64, kernel_initializer='he_normal')(fc1)\n",
" fc1 = PReLU()(fc1)\n",
" fc1 = BatchNormalization()(fc1)\n",
" fc1 = Dropout(0.5)(fc1)\n",
"\n",
" outputs = Dense(1, kernel_initializer='he_normal', activation='sigmoid')(fc1)\n",
"\n",
" model = Model(inputs = inputs, outputs = outputs)\n",
" model.compile(loss='binary_crossentropy', optimizer='adam')\n",
" return (model)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 3.2 Start to Train ##"
]
},
{
"cell_type": "code",
"execution_count": 51,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/Users/stevenhu/anaconda3/envs/run/lib/python3.6/site-packages/ipykernel_launcher.py:22: UserWarning: The `merge` function is deprecated and will be removed after 08/2017. Use instead layers from `keras.layers.merge`, e.g. `add`, `concatenate`, etc.\n",
"/Users/stevenhu/anaconda3/envs/run/lib/python3.6/site-packages/keras/legacy/layers.py:465: UserWarning: The `Merge` layer is deprecated and will be removed after 08/2017. Use instead layers from `keras.layers.merge`, e.g. `add`, `concatenate`, etc.\n",
" name=name)\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Train on 297606 samples, validate on 297606 samples\n",
"Epoch 1/1\n",
" - 27s - loss: 0.3061 - val_loss: 0.1639\n",
"local fold Gini: 0.209322322663\n",
"Train on 297606 samples, validate on 297606 samples\n",
"Epoch 1/1\n",
" - 28s - loss: 0.3087 - val_loss: 0.1645\n",
"local fold Gini: 0.201585256464\n",
"seed 0: Gini 0.20545379936910999\n",
"Total training time: 0:01:55.265348\n"
]
}
],
"source": [
"\"\"\"\n",
"#validation fold\n",
"NFOLDS = 5\n",
"kfold = StratifiedKFold(n_splits=NFOLDS, shuffle=True, random_state=218)\n",
"\n",
"I change \"test\" to \"vaild\" because I feel it is clear to understand\n",
"\"\"\"\n",
"\n",
"cv_train = np.zeros(len(train_label))\n",
"cv_pred = np.zeros(len(test_id))\n",
"\n",
"#validation fold\n",
"NFOLDS = 5\n",
"kfold = StratifiedKFold(n_splits=NFOLDS, shuffle=True, random_state=218)\n",
"\n",
"#with different random see make result stable.\n",
"num_seeds = 5\n",
"begintime = time()\n",
"if cv_only:\n",
" for s in range(num_seeds):\n",
" np.random.seed(s)\n",
" for (train_index, valid_index) in kfold.split(X, train_label):\n",
" \n",
" #assign data from training data and labels to validation data; \n",
" x_train = X[train_index]\n",
" y_train = train_label[train_index]\n",
" x_valid= X[valid_index]\n",
" y_valid = train_label[valid_index]\n",
" \n",
" # assign X_cat to validation data; \n",
" x_train_cat = X_cat[train_index]\n",
" x_valid_cat = X_cat[valid_index]\n",
"\n",
" #Package data for training, the package(list) is [[cat_featrues], x_train] \n",
" # or [ ['ps_ind_02_cat', 'ps_ind_04_cat', 'ps_ind_05_cat',.....,'ps_car_11_cat'] ,x_train]\n",
" \n",
" x_train_cat_list, x_valid_cat_list = [], []\n",
" for i in range(x_train_cat.shape[1]):\n",
" x_train_cat_list.append(x_train_cat[:, i].reshape(-1, 1))\n",
" x_valid_cat_list.append(x_valid_cat[:, i].reshape(-1, 1))\n",
"\n",
" x_train_cat_list.append(x_train)\n",
" x_valid_cat_list.append(x_valid)\n",
" \n",
" #load model\n",
" model = nn_model()\n",
" \n",
" def get_rank(x):\n",
" return pd.Series(x).rank(pct=True).values\n",
" #fit model. Note: Change epochs to make prediction accuracy\n",
" model.fit(x_train_cat_list, y_train, epochs=10, batch_size=512, verbose=2, validation_data=[x_valid_cat_list, y_valid])\n",
" \n",
" #record prediction with validation data\n",
" cv_train[valid_index] += get_rank(model.predict(x=x_valid_cat_list, batch_size=512, verbose=0)[:, 0])\n",
" print('local fold Gini: ',Gini(train_label[valid_index], cv_train[valid_index]))\n",
" \n",
" #recode prediction with testing data\n",
" cv_pred += get_rank(model.predict(x=X_TEST_CAT, batch_size=512, verbose=0)[:, 0])\n",
" \n",
" \n",
" \n",
" print(\"seed {0}: Gini {1}\".format(s,Gini(train_label, cv_train / (1. * (s + 1)))))\n",
" print(\"Total training time: \",str(datetime.timedelta(seconds=time() - begintime)))\n",
" if save_cv:\n",
" \n",
" #divid (NFOLDS * num_seeds) to get average of probablity \n",
" pd.DataFrame({'id': test_id, 'target': get_rank(cv_pred * 1./ (NFOLDS * num_seeds))}).to_csv('../model/keras5_pred.csv', index=False)\n",
" pd.DataFrame({'id': train_id, 'target': get_rank(cv_train * 1. / num_seeds)}).to_csv('../model/keras5_cv.csv', index=False)\n",
"\n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python [default]",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.4"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
================================================
FILE: Jupyter_nnmodel/util.py
================================================
import numpy as np
import pandas as pd
def Gini(y_true, y_pred):
# check and get number of samples
assert y_true.shape == y_pred.shape
n_samples = y_true.shape[0]
# sort rows on prediction column
# (from largest to smallest)
arr = np.array([y_true, y_pred]).transpose()
true_order = arr[arr[:, 0].argsort()][::-1, 0]
pred_order = arr[arr[:, 1].argsort()][::-1, 0]
# get Lorenz curves
L_true = np.cumsum(true_order) * 1. / np.sum(true_order)
L_pred = np.cumsum(pred_order) * 1. / np.sum(pred_order)
L_ones = np.linspace(1 / n_samples, 1, n_samples)
# get Gini coefficients (area between curves)
G_true = np.sum(L_ones - L_true)
G_pred = np.sum(L_ones - L_pred)
# normalize to true Gini coefficient
return G_pred * 1. / G_true
def cat_count(train_df, test_df, cat_list):
train_df['row_id'] = range(train_df.shape[0])
test_df['row_id'] = range(test_df.shape[0])
train_df['train'] = 1
test_df['train'] = 0
all_df = train_df[['row_id', 'train'] + cat_list].append(test_df[['row_id','train'] + cat_list])
for e, cat in enumerate(cat_list):
grouped = all_df[[cat]].groupby(cat)
the_size = pd.DataFrame(grouped.size()).reset_index()
the_size.columns = [cat, '{}_size'.format(cat)]
all_df = pd.merge(all_df, the_size, how='left')
selected_train = all_df[all_df['train'] == 1]
selected_test = all_df[all_df['train'] == 0]
selected_train.sort_values('row_id', inplace=True)
selected_test.sort_values('row_id', inplace=True)
selected_train.drop(['row_id', 'train'] + cat_list, axis=1, inplace=True)
selected_test.drop(['row_id', 'train'] + cat_list, axis=1, inplace=True)
selected_train, selected_test = np.array(selected_train), np.array(selected_test)
print(selected_train.shape, selected_test.shape)
return selected_train, selected_test
def proj_num_on_cat(train_df, test_df, target_column, group_column):
"""
:param train_df: train data frame
:param test_df: test data frame
:param target_column: name of numerical feature
:param group_column: name of categorical feature
"""
train_df['row_id'] = range(train_df.shape[0]) # 595211 create index for each row
test_df['row_id'] = range(test_df.shape[0])
train_df['train'] = 1
test_df['train'] = 0
all_df = train_df[['row_id', 'train', target_column, group_column]].append(test_df[['row_id','train',
target_column, group_column]]).copy()
#count the number
grouped = all_df[[target_column, group_column]].groupby(group_column)
#count the number of distint value from the list [1,1, 2,3]
#[1,2,3] so answer is 3
#count the number of each distint value [1,1,2,3]
#1:2
#2:1
#3:1
#count the number of each distint value
the_size = pd.DataFrame(grouped.size()).reset_index()
the_size.columns = [group_column, '%s_size' % target_column] #rename columns name
#find the mean, std, median, max, min of each distint value
the_mean = pd.DataFrame(grouped.mean()).reset_index()
the_mean.columns = [group_column, '%s_mean' % target_column] #rename columns name
the_std = pd.DataFrame(grouped.std()).reset_index().fillna(0)
the_std.columns = [group_column, '%s_std' % target_column]
the_median = pd.DataFrame(grouped.median()).reset_index()
the_median.columns = [group_column, '%s_median' % target_column]
the_max = pd.DataFrame(grouped.max()).reset_index()
the_max.columns = [group_column, '%s_max' % target_column]
the_min = pd.DataFrame(grouped.min()).reset_index()
the_min.columns = [group_column, '%s_min' % target_column]
#merge them
the_stats=pd.concat([the_size,the_mean.iloc[:,1],the_std.iloc[:,1]
,the_median.iloc[:,1] ,the_max.iloc[:,1],the_min.iloc[:,1]]
,axis=1, join_axes=[the_size.index])
#insert value to the original data
all_df = pd.merge(all_df, the_stats, how='left')
#splite to train and test
selected_train = all_df[all_df['train'] == 1].copy()
selected_test = all_df[all_df['train'] == 0].copy()
selected_train.sort_values('row_id', inplace=True)
selected_test.sort_values('row_id', inplace=True)
#remove target_column, group_column, 'row_id', 'train' columns
selected_train.drop([target_column, group_column, 'row_id', 'train'], axis=1, inplace=True)
selected_test.drop([target_column, group_column, 'row_id', 'train'], axis=1, inplace=True)
selected_train, selected_test = np.array(selected_train), np.array(selected_test)
return selected_train, selected_test
def interaction_features(train, test, fea1, fea2, prefix):
train['inter_{}*'.format(prefix)] = train[fea1] * train[fea2]
train['inter_{}/'.format(prefix)] = train[fea1] / train[fea2]
test['inter_{}*'.format(prefix)] = test[fea1] * test[fea2]
test['inter_{}/'.format(prefix)] = test[fea1] / test[fea2]
feature_name = ['inter_{}*'.format(prefix), 'inter_{}/'.format(prefix), ]
return train, test, feature_name
================================================
FILE: code/fea_eng0.py
================================================
"""xgb prediction as features"""
import xgboost as xgb
from sklearn.model_selection import KFold
import numpy as np
import pandas as pd
eta = 0.1
max_depth = 6
subsample = 0.9
colsample_bytree = 0.85
min_child_weight = 55
num_boost_round = 500
train = pd.read_csv("../input/train.csv")
train_label = train['target']
train_id = train['id']
del train['target'], train['id']
test = pd.read_csv("../input/test.csv")
test_id = test['id']
del test['id']
params = {"objective": "reg:linear",
"booster": "gbtree",
"eta": eta,
"max_depth": int(max_depth),
"subsample": subsample,
"colsample_bytree": colsample_bytree,
"min_child_weight": min_child_weight,
"silent": 1
}
data = train.append(test)
data.reset_index(inplace=True)
train_rows = train.shape[0]
feature_results = []
for target_g in ['car', 'ind', 'reg']:
features = [x for x in list(data) if target_g not in x]
target_list = [x for x in list(data) if target_g in x]
train_fea = np.array(data[features])
for target in target_list:
print(target)
train_label = data[target]
kfold = KFold(n_splits=5, random_state=218, shuffle=True)
kf = kfold.split(data)
cv_train = np.zeros(shape=(data.shape[0], 1))
for i, (train_fold, validate) in enumerate(kf):
X_train, X_validate, label_train, label_validate = \
train_fea[train_fold, :], train_fea[validate, :], train_label[train_fold], train_label[validate]
dtrain = xgb.DMatrix(X_train, label_train)
dvalid = xgb.DMatrix(X_validate, label_validate)
watchlist = [(dtrain, 'train'), (dvalid, 'valid')]
bst = xgb.train(params, dtrain, num_boost_round, evals=watchlist, verbose_eval=50,
early_stopping_rounds=10)
cv_train[validate, 0] += bst.predict(xgb.DMatrix(X_validate), ntree_limit=bst.best_ntree_limit)
feature_results.append(cv_train)
feature_results = np.hstack(feature_results)
train_features = feature_results[:train_rows, :]
test_features = feature_results[train_rows:, :]
import pickle
pickle.dump([train_features, test_features], open("../input/fea0.pk", 'wb'))
================================================
FILE: code/gbm_model291.py
================================================
import lightgbm as lgbm
from scipy import sparse as ssp
from sklearn.model_selection import StratifiedKFold
import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
def Gini(y_true, y_pred):
# check and get number of samples
assert y_true.shape == y_pred.shape
n_samples = y_true.shape[0]
# sort rows on prediction column
# (from largest to smallest)
arr = np.array([y_true, y_pred]).transpose()
true_order = arr[arr[:, 0].argsort()][::-1, 0]
pred_order = arr[arr[:, 1].argsort()][::-1, 0]
# get Lorenz curves
L_true = np.cumsum(true_order) * 1. / np.sum(true_order)
L_pred = np.cumsum(pred_order) * 1. / np.sum(pred_order)
L_ones = np.linspace(1 / n_samples, 1, n_samples)
# get Gini coefficients (area between curves)
G_true = np.sum(L_ones - L_true)
G_pred = np.sum(L_ones - L_pred)
# normalize to true Gini coefficient
return G_pred * 1. / G_true
cv_only = True
save_cv = True
full_train = False
def evalerror(preds, dtrain):
labels = dtrain.get_label()
return 'gini', Gini(labels, preds), True
path = "../input/"
train = pd.read_csv(path+'train.csv')
train_label = train['target']
train_id = train['id']
test = pd.read_csv(path+'test.csv')
test_id = test['id']
NFOLDS = 5
kfold = StratifiedKFold(n_splits=NFOLDS, shuffle=True, random_state=218)
y = train['target'].values
drop_feature = [
'id',
'target'
]
X = train.drop(drop_feature,axis=1)
feature_names = X.columns.tolist()
cat_features = [c for c in feature_names if ('cat' in c and 'count' not in c)]
num_features = [c for c in feature_names if ('cat' not in c and 'calc' not in c)]
train['missing'] = (train==-1).sum(axis=1).astype(float)
test['missing'] = (test==-1).sum(axis=1).astype(float)
num_features.append('missing')
for c in cat_features:
le = LabelEncoder()
le.fit(train[c])
train[c] = le.transform(train[c])
test[c] = le.transform(test[c])
enc = OneHotEncoder()
enc.fit(train[cat_features])
X_cat = enc.transform(train[cat_features])
X_t_cat = enc.transform(test[cat_features])
ind_features = [c for c in feature_names if 'ind' in c]
count=0
for c in ind_features:
if count==0:
train['new_ind'] = train[c].astype(str)+'_'
test['new_ind'] = test[c].astype(str)+'_'
count+=1
else:
train['new_ind'] += train[c].astype(str)+'_'
test['new_ind'] += test[c].astype(str)+'_'
cat_count_features = []
for c in cat_features+['new_ind']:
d = pd.concat([train[c],test[c]]).value_counts().to_dict()
train['%s_count'%c] = train[c].apply(lambda x:d.get(x,0))
test['%s_count'%c] = test[c].apply(lambda x:d.get(x,0))
cat_count_features.append('%s_count'%c)
train_list = [train[num_features+cat_count_features].values,X_cat,]
test_list = [test[num_features+cat_count_features].values,X_t_cat,]
X = ssp.hstack(train_list).tocsr()
X_test = ssp.hstack(test_list).tocsr()
learning_rate = 0.1
num_leaves = 15
min_data_in_leaf = 2000
feature_fraction = 0.6
num_boost_round = 10000
params = {"objective": "binary",
"boosting_type": "gbdt",
"learning_rate": learning_rate,
"num_leaves": num_leaves,
"max_bin": 256,
"feature_fraction": feature_fraction,
"verbosity": 0,
"drop_rate": 0.1,
"is_unbalance": False,
"max_drop": 50,
"min_child_samples": 10,
"min_child_weight": 150,
"min_split_gain": 0,
"subsample": 0.9
}
x_score = []
final_cv_train = np.zeros(len(train_label))
final_cv_pred = np.zeros(len(test_id))
for s in xrange(16):
cv_train = np.zeros(len(train_label))
cv_pred = np.zeros(len(test_id))
params['seed'] = s
if cv_only:
kf = kfold.split(X, train_label)
best_trees = []
fold_scores = []
for i, (train_fold, validate) in enumerate(kf):
X_train, X_validate, label_train, label_validate = \
X[train_fold, :], X[validate, :], train_label[train_fold], train_label[validate]
dtrain = lgbm.Dataset(X_train, label_train)
dvalid = lgbm.Dataset(X_validate, label_validate, reference=dtrain)
bst = lgbm.train(params, dtrain, num_boost_round, valid_sets=dvalid, feval=evalerror, verbose_eval=100,
early_stopping_rounds=100)
best_trees.append(bst.best_iteration)
cv_pred += bst.predict(X_test, num_iteration=bst.best_iteration)
cv_train[validate] += bst.predict(X_validate)
score = Gini(label_validate, cv_train[validate])
print score
fold_scores.append(score)
cv_pred /= NFOLDS
final_cv_train += cv_train
final_cv_pred += cv_pred
print("cv score:")
print Gini(train_label, cv_train)
print "current score:", Gini(train_label, final_cv_train / (s + 1.)), s+1
print(fold_scores)
print(best_trees, np.mean(best_trees))
x_score.append(Gini(train_label, cv_train))
print(x_score)
pd.DataFrame({'id': test_id, 'target': final_cv_pred / 16.}).to_csv('../model/lgbm3_pred_avg.csv', index=False)
pd.DataFrame({'id': train_id, 'target': final_cv_train / 16.}).to_csv('../model/lgbm3_cv_avg.csv', index=False)
================================================
FILE: code/nn_model290.py
================================================
from keras.layers import Dense, Dropout, Embedding, Flatten, Input, merge
from keras.layers.normalization import BatchNormalization
from keras.layers.advanced_activations import PReLU
from time import time
import datetime
from keras.models import Model
from sklearn.model_selection import StratifiedKFold
import numpy as np
import pandas as pd
from util import Gini, interaction_features
from itertools import combinations
from util import proj_num_on_cat
from scipy import sparse
from sklearn.preprocessing import StandardScaler
import pickle
from sklearn.preprocessing import LabelEncoder
cv_only = True
save_cv = True
NFOLDS = 5
kfold = StratifiedKFold(n_splits=NFOLDS, shuffle=True, random_state=218)
train = pd.read_csv("../input/train.csv")
train_label = train['target']
train_id = train['id']
del train['target'], train['id']
test = pd.read_csv("../input/test.csv")
test_id = test['id']
del test['id']
cat_fea = [x for x in list(train) if 'cat' in x]
bin_fea = [x for x in list(train) if 'bin' in x]
train['missing'] = (train==-1).sum(axis=1).astype(float)
test['missing'] = (test==-1).sum(axis=1).astype(float)
# include interactions
for e, (x, y) in enumerate(combinations(['ps_car_13', 'ps_ind_03', 'ps_reg_03', 'ps_ind_15', 'ps_reg_01', 'ps_ind_01'], 2)):
train, test = interaction_features(train, test, x, y, e)
num_features = [c for c in list(train) if ('cat' not in c and 'calc' not in c)]
num_features.append('missing')
inter_fea = [x for x in list(train) if 'inter' in x]
feature_names = list(train)
ind_features = [c for c in feature_names if 'ind' in c]
count = 0
for c in ind_features:
if count == 0:
train['new_ind'] = train[c].astype(str)
count += 1
else:
train['new_ind'] += '_' + train[c].astype(str)
ind_features = [c for c in feature_names if 'ind' in c]
count = 0
for c in ind_features:
if count == 0:
test['new_ind'] = test[c].astype(str)
count += 1
else:
test['new_ind'] += '_' + test[c].astype(str)
reg_features = [c for c in feature_names if 'reg' in c]
count = 0
for c in reg_features:
if count == 0:
train['new_reg'] = train[c].astype(str)
count += 1
else:
train['new_reg'] += '_' + train[c].astype(str)
reg_features = [c for c in feature_names if 'reg' in c]
count = 0
for c in reg_features:
if count == 0:
test['new_reg'] = test[c].astype(str)
count += 1
else:
test['new_reg'] += '_' + test[c].astype(str)
car_features = [c for c in feature_names if 'car' in c]
count = 0
for c in car_features:
if count == 0:
train['new_car'] = train[c].astype(str)
count += 1
else:
train['new_car'] += '_' + train[c].astype(str)
car_features = [c for c in feature_names if 'car' in c]
count = 0
for c in car_features:
if count == 0:
test['new_car'] = test[c].astype(str)
count += 1
else:
test['new_car'] += '_' + test[c].astype(str)
train_cat = train[cat_fea]
train_num = train[[x for x in list(train) if x in num_features]]
test_cat = test[cat_fea]
test_num = test[[x for x in list(train) if x in num_features]]
max_cat_values = []
for c in cat_fea:
le = LabelEncoder()
x = le.fit_transform(pd.concat([train_cat, test_cat])[c])
train_cat[c] = le.transform(train_cat[c])
test_cat[c] = le.transform(test_cat[c])
max_cat_values.append(np.max(x))
# xgboost prediction
train_fea0, test_fea0 = pickle.load(open("../input/fea0.pk"))
cat_count_features = []
for c in cat_fea + ['new_ind','new_reg','new_car']:
d = pd.concat([train[c],test[c]]).value_counts().to_dict()
train['%s_count'%c] = train[c].apply(lambda x:d.get(x,0))
test['%s_count'%c] = test[c].apply(lambda x:d.get(x,0))
cat_count_features.append('%s_count'%c)
print(train_num.dtypes)
train_list = [train_num.replace([np.inf, -np.inf, np.nan], 0), train[cat_count_features], train_fea0]
test_list = [test_num.replace([np.inf, -np.inf, np.nan], 0), test[cat_count_features], test_fea0]
#feature aggregation
for t in ['ps_car_13', 'ps_ind_03', 'ps_reg_03', 'ps_ind_15', 'ps_reg_01', 'ps_ind_01']:
for g in ['ps_car_13', 'ps_ind_03', 'ps_reg_03', 'ps_ind_15', 'ps_reg_01', 'ps_ind_01', 'ps_ind_05_cat']:
if t != g:
s_train, s_test = proj_num_on_cat(train, test, target_column=t, group_column=g)
train_list.append(s_train)
test_list.append(s_test)
X = sparse.hstack(train_list).tocsr()
X_test = sparse.hstack(test_list).tocsr()
all_data = np.vstack([X.toarray(), X_test.toarray()])
scaler = StandardScaler()
scaler.fit(all_data)
X = scaler.transform(X.toarray())
X_test = scaler.transform(X_test.toarray())
print(X.shape, X_test.shape)
cv_train = np.zeros(len(train_label))
cv_pred = np.zeros(len(test_id))
X_cat = train_cat.as_matrix()
X_test_cat = test_cat.as_matrix()
x_test_cat = []
for i in xrange(X_test_cat.shape[1]):
x_test_cat.append(X_test_cat[:, i].reshape(-1, 1))
x_test_cat.append(X_test)
def nn_model():
inputs = []
flatten_layers = []
for e, c in enumerate(cat_fea):
input_c = Input(shape=(1, ), dtype='int32')
num_c = max_cat_values[e]
embed_c = Embedding(
num_c,
6,
input_length=1
)(input_c)
embed_c = Dropout(0.25)(embed_c)
flatten_c = Flatten()(embed_c)
inputs.append(input_c)
flatten_layers.append(flatten_c)
input_num = Input(shape=(X.shape[1],), dtype='float32')
flatten_layers.append(input_num)
inputs.append(input_num)
flatten = merge(flatten_layers, mode='concat')
fc1 = Dense(512, init='he_normal')(flatten)
fc1 = PReLU()(fc1)
fc1 = BatchNormalization()(fc1)
fc1 = Dropout(0.75)(fc1)
fc1 = Dense(64, init='he_normal')(fc1)
fc1 = PReLU()(fc1)
fc1 = BatchNormalization()(fc1)
fc1 = Dropout(0.5)(fc1)
outputs = Dense(1, init='he_normal', activation='sigmoid')(fc1)
model = Model(input = inputs, output = outputs)
model.compile(loss='binary_crossentropy', optimizer='adam')
return (model)
num_seeds = 5
begintime = time()
if cv_only:
for s in xrange(num_seeds):
np.random.seed(s)
for (inTr, inTe) in kfold.split(X, train_label):
xtr = X[inTr]
ytr = train_label[inTr]
xte = X[inTe]
yte = train_label[inTe]
xtr_cat = X_cat[inTr]
xte_cat = X_cat[inTe]
# get xtr xte cat
xtr_cat_list, xte_cat_list = [], []
for i in xrange(xtr_cat.shape[1]):
xtr_cat_list.append(xtr_cat[:, i].reshape(-1, 1))
xte_cat_list.append(xte_cat[:, i].reshape(-1, 1))
xtr_cat_list.append(xtr)
xte_cat_list.append(xte)
model = nn_model()
def get_rank(x):
return pd.Series(x).rank(pct=True).values
model.fit(xtr_cat_list, ytr, epochs=20, batch_size=512, verbose=2, validation_data=[xte_cat_list, yte])
cv_train[inTe] += get_rank(model.predict(x=xte_cat_list, batch_size=512, verbose=0)[:, 0])
print(Gini(train_label[inTe], cv_train[inTe]))
cv_pred += get_rank(model.predict(x=x_test_cat, batch_size=512, verbose=0)[:, 0])
print(s)
print(Gini(train_label, cv_train / (1. * (s + 1))))
print(str(datetime.timedelta(seconds=time() - begintime)))
if save_cv:
pd.DataFrame({'id': test_id, 'target': get_rank(cv_pred * 1./ (NFOLDS * num_seeds))}).to_csv('../model/keras5_pred.csv', index=False)
pd.DataFrame({'id': train_id, 'target': get_rank(cv_train * 1. / num_seeds)}).to_csv('../model/keras5_cv.csv', index=False)
================================================
FILE: code/simple_average.py
================================================
'''
simple average of two models to get 2nd place
'''
import pandas as pd
keras5_test = pd.read_csv("../model/keras5_pred.csv")
lgbm3_test = pd.read_csv("../model/lgbm3_pred_avg.csv")
def get_rank(x):
return pd.Series(x).rank(pct=True).values
pd.DataFrame({'id': keras5_test['id'], 'target':
get_rank(keras5_test['target']) * 0.5 + get_rank(keras5_test['target']) * 0.5}).to_csv(
"../model/simple_average.csv", index = False)
================================================
FILE: code/util.py
================================================
import numpy as np
import pandas as pd
def Gini(y_true, y_pred):
# check and get number of samples
assert y_true.shape == y_pred.shape
n_samples = y_true.shape[0]
# sort rows on prediction column
# (from largest to smallest)
arr = np.array([y_true, y_pred]).transpose()
true_order = arr[arr[:, 0].argsort()][::-1, 0]
pred_order = arr[arr[:, 1].argsort()][::-1, 0]
# get Lorenz curves
L_true = np.cumsum(true_order) * 1. / np.sum(true_order)
L_pred = np.cumsum(pred_order) * 1. / np.sum(pred_order)
L_ones = np.linspace(1 / n_samples, 1, n_samples)
# get Gini coefficients (area between curves)
G_true = np.sum(L_ones - L_true)
G_pred = np.sum(L_ones - L_pred)
# normalize to true Gini coefficient
return G_pred * 1. / G_true
def cat_count(train_df, test_df, cat_list):
train_df['row_id'] = range(train_df.shape[0])
test_df['row_id'] = range(test_df.shape[0])
train_df['train'] = 1
test_df['train'] = 0
all_df = train_df[['row_id', 'train'] + cat_list].append(test_df[['row_id','train'] + cat_list])
for e, cat in enumerate(cat_list):
grouped = all_df[[cat]].groupby(cat)
the_size = pd.DataFrame(grouped.size()).reset_index()
the_size.columns = [cat, '{}_size'.format(cat)]
all_df = pd.merge(all_df, the_size, how='left')
selected_train = all_df[all_df['train'] == 1]
selected_test = all_df[all_df['train'] == 0]
selected_train.sort_values('row_id', inplace=True)
selected_test.sort_values('row_id', inplace=True)
selected_train.drop(['row_id', 'train'] + cat_list, axis=1, inplace=True)
selected_test.drop(['row_id', 'train'] + cat_list, axis=1, inplace=True)
selected_train, selected_test = np.array(selected_train), np.array(selected_test)
print(selected_train.shape, selected_test.shape)
return selected_train, selected_test
def proj_num_on_cat(train_df, test_df, target_column, group_column):
"""
:param train_df: train data frame
:param test_df: test data frame
:param target_column: name of numerical feature
:param group_column: name of categorical feature
"""
train_df['row_id'] = range(train_df.shape[0])
test_df['row_id'] = range(test_df.shape[0])
train_df['train'] = 1
test_df['train'] = 0
all_df = train_df[['row_id', 'train', target_column, group_column]].append(test_df[['row_id','train',
target_column, group_column]])
grouped = all_df[[target_column, group_column]].groupby(group_column)
the_size = pd.DataFrame(grouped.size()).reset_index()
the_size.columns = [group_column, '%s_size' % target_column]
the_mean = pd.DataFrame(grouped.mean()).reset_index()
the_mean.columns = [group_column, '%s_mean' % target_column]
the_std = pd.DataFrame(grouped.std()).reset_index().fillna(0)
the_std.columns = [group_column, '%s_std' % target_column]
the_median = pd.DataFrame(grouped.median()).reset_index()
the_median.columns = [group_column, '%s_median' % target_column]
the_stats = pd.merge(the_size, the_mean)
the_stats = pd.merge(the_stats, the_std)
the_stats = pd.merge(the_stats, the_median)
the_max = pd.DataFrame(grouped.max()).reset_index()
the_max.columns = [group_column, '%s_max' % target_column]
the_min = pd.DataFrame(grouped.min()).reset_index()
the_min.columns = [group_column, '%s_min' % target_column]
the_stats = pd.merge(the_stats, the_max)
the_stats = pd.merge(the_stats, the_min)
all_df = pd.merge(all_df, the_stats, how='left')
selected_train = all_df[all_df['train'] == 1]
selected_test = all_df[all_df['train'] == 0]
selected_train.sort_values('row_id', inplace=True)
selected_test.sort_values('row_id', inplace=True)
selected_train.drop([target_column, group_column, 'row_id', 'train'], axis=1, inplace=True)
selected_test.drop([target_column, group_column, 'row_id', 'train'], axis=1, inplace=True)
selected_train, selected_test = np.array(selected_train), np.array(selected_test)
print(selected_train.shape, selected_test.shape)
return selected_train, selected_test
def interaction_features(train, test, fea1, fea2, prefix):
train['inter_{}*'.format(prefix)] = train[fea1] * train[fea2]
train['inter_{}/'.format(prefix)] = train[fea1] / train[fea2]
test['inter_{}*'.format(prefix)] = test[fea1] * test[fea2]
test['inter_{}/'.format(prefix)] = test[fea1] / test[fea2]
return train, test
================================================
FILE: code_for_exact_solution/keras3.py
================================================
import os
import sys
import operator
import numpy as np
import pandas as pd
from scipy import sparse
import xgboost as xgb
from sklearn import model_selection, preprocessing, ensemble
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation
from keras.layers.normalization import BatchNormalization
from keras.layers.advanced_activations import PReLU
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import StratifiedKFold
from time import time
import datetime
from util import Gini, proj_num_on_cat, interaction_features, cat_count
from sklearn.preprocessing import StandardScaler
from itertools import combinations
import pickle
'''
simple xgboost benchmark
'''
from sklearn.model_selection import StratifiedKFold, KFold
from sklearn.feature_selection import SelectPercentile, f_classif
import numpy as np
import pandas as pd
from util import Gini, proj_num_on_cat, interaction_features, cat_count
from sklearn.preprocessing import LabelEncoder
from itertools import combinations
from util import proj_num_on_cat
from sklearn.preprocessing import OneHotEncoder
import category_encoders as ce
from scipy import sparse
from sklearn.linear_model import LogisticRegression as LR
from sklearn.preprocessing import StandardScaler
from scipy.sparse import csr_matrix
import pickle
cv_only = True
save_cv = True
full_train = True
def evalerror(preds, dtrain):
labels = dtrain.get_label()
return 'gini', Gini(labels, preds)
NFOLDS = 5
kfold = StratifiedKFold(n_splits=NFOLDS, shuffle=True, random_state=218)
train = pd.read_csv("../input/train.csv")
train_label = train['target']
train_id = train['id']
del train['target'], train['id']
test = pd.read_csv("../input/test.csv")
test_id = test['id']
del test['id']
cat_fea = [x for x in list(train) if 'cat' in x]
bin_fea = [x for x in list(train) if 'bin' in x]
train['missing'] = (train==-1).sum(axis=1).astype(float)
test['missing'] = (test==-1).sum(axis=1).astype(float)
# include interactions
for e, (x, y) in enumerate(combinations(['ps_car_13', 'ps_ind_03', 'ps_reg_03', 'ps_ind_15', 'ps_reg_01', 'ps_ind_01'], 2)):
train, test = interaction_features(train, test, x, y, e)
num_features = [c for c in list(train) if ('cat' not in c and 'calc' not in c)]
num_features.append('missing')
inter_fea = [x for x in list(train) if 'inter' in x]
#train['cat_sum'] = train[cat_fea].sum(axis=1)
#test['cat_sum'] = test[cat_fea].sum(axis=1)
#X = train.as_matrix()
#X_test = test.as_matrix()
#print(X.shape, X_test.shape)
#ohe
ohe = OneHotEncoder(sparse=True)
train_cat = train[cat_fea].as_matrix()
train_num = train[[x for x in list(train) if x in num_features]]
test_cat = test[cat_fea].as_matrix()
test_num = test[[x for x in list(train) if x in num_features]]
train_cat[train_cat < 0] = 99
test_cat[test_cat < 0] = 99
traintest = np.vstack((train_cat, test_cat))
traintest = pd.DataFrame(traintest, columns=cat_fea)
print(traintest.shape)
#encoder = ce.HelmertEncoder(cols=cat_fea)
#encoder.fit(traintest)
#train_enc = encoder.transform(pd.DataFrame(train_cat, columns=cat_fea))
#test_enc = encoder.transform(pd.DataFrame(test_cat, columns=cat_fea))
ohe.fit(traintest)
train_ohe = ohe.transform(train_cat)
test_ohe = ohe.transform(test_cat)
del traintest
train_fea0, test_fea0 = pickle.load(open("../input/fea0.pk"))
cat_count_features = []
for c in cat_fea:
d = pd.concat([train[c],test[c]]).value_counts().to_dict()
train['%s_count'%c] = train[c].apply(lambda x:d.get(x,0))
test['%s_count'%c] = test[c].apply(lambda x:d.get(x,0))
cat_count_features.append('%s_count'%c)
train_list = [train_num.replace([np.inf, -np.inf, np.nan], 0), train_ohe, train[cat_count_features], train_fea0]#, np.ones(shape=(train_num.shape[0], 1))]
test_list = [test_num.replace([np.inf, -np.inf, np.nan], 0), test_ohe, test[cat_count_features], test_fea0]#, np.ones(shape=(test_num.shape[0], 1))]
#proj
for t in ['ps_car_13', 'ps_ind_03', 'ps_reg_03', 'ps_ind_15', 'ps_reg_01', 'ps_ind_01']:
for g in ['ps_car_13', 'ps_ind_03', 'ps_reg_03', 'ps_ind_15', 'ps_reg_01', 'ps_ind_01', 'ps_ind_05_cat']:
if t != g:
s_train, s_test = proj_num_on_cat(train, test, target_column=t, group_column=g)
train_list.append(s_train)
test_list.append(s_test)
X = sparse.hstack(train_list).tocsr()
X_test = sparse.hstack(test_list).tocsr()
#X = train_num
#X_test = test_num
all_data = np.vstack([X.toarray(), X_test.toarray()])
#all_data = np.vstack([X, X_test])
scaler = StandardScaler()
scaler.fit(all_data)
X = scaler.transform(X.toarray())
X_test = scaler.transform(X_test.toarray())
#X = scaler.transform(X)
#X_test = scaler.transform(X_test)
print(X.shape, X_test.shape)
cv_train = np.zeros(len(train_label))
cv_pred = np.zeros(len(test_id))
def nn_model():
model = Sequential()
model.add(Dense(512, input_dim=X.shape[1], init='he_normal'))
model.add(PReLU())
model.add(BatchNormalization())
model.add(Dropout(0.9))
model.add(Dense(64, init='he_normal'))
model.add(PReLU())
model.add(BatchNormalization())
model.add(Dropout(0.8))
model.add(Dense(1, init='he_normal', activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam')
return (model)
num_seeds = 5
begintime = time()
if cv_only:
for s in xrange(num_seeds):
np.random.seed(s)
for (inTr, inTe) in kfold.split(X, train_label):
xtr = X[inTr]
ytr = train_label[inTr]
xte = X[inTe]
yte = train_label[inTe]
model = nn_model()
model.fit(xtr, ytr, epochs=35, batch_size=512, verbose=2, validation_data=[xte, yte])
cv_train[inTe] += model.predict_proba(x=xte, batch_size=512, verbose=0)[:, 0]
cv_pred += model.predict_proba(x=X_test, batch_size=512, verbose=0)[:, 0]
print(s)
print(Gini(train_label, cv_train / (1. * (s + 1))))
print(str(datetime.timedelta(seconds=time() - begintime)))
if save_cv:
pd.DataFrame({'id': test_id, 'target': cv_pred * 1./ (NFOLDS * num_seeds)}).to_csv('../model/keras3_pred.csv', index=False)
pd.DataFrame({'id': train_id, 'target': cv_train * 1. / num_seeds}).to_csv('../model/keras3_cv.csv', index=False)
================================================
FILE: code_for_exact_solution/keras6.py
================================================
import os
import sys
import operator
import numpy as np
import pandas as pd
from scipy import sparse
import xgboost as xgb
from sklearn import model_selection, preprocessing, ensemble
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation, Embedding, Flatten, concatenate, Input, merge
from keras.layers.normalization import BatchNormalization
from keras.layers.advanced_activations import PReLU
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import StratifiedKFold
from time import time
import datetime
from util import Gini, proj_num_on_cat, interaction_features, cat_count
from sklearn.preprocessing import StandardScaler
from itertools import combinations
import pickle
from keras.models import Model
'''
simple xgboost benchmark
'''
from sklearn.model_selection import StratifiedKFold, KFold
from sklearn.feature_selection import SelectPercentile, f_classif
import numpy as np
import pandas as pd
from util import Gini, proj_num_on_cat, interaction_features, cat_count
from sklearn.preprocessing import LabelEncoder
from itertools import combinations
from util import proj_num_on_cat
from sklearn.preprocessing import OneHotEncoder
import category_encoders as ce
from scipy import sparse
from sklearn.linear_model import LogisticRegression as LR
from sklearn.preprocessing import StandardScaler
from scipy.sparse import csr_matrix
import pickle
cv_only = True
save_cv = True
full_train = True
def evalerror(preds, dtrain):
labels = dtrain.get_label()
return 'gini', Gini(labels, preds)
NFOLDS = 5
kfold = StratifiedKFold(n_splits=NFOLDS, shuffle=True, random_state=218)
train = pd.read_csv("../input/train.csv")
train_label = train['target']
train_id = train['id']
del train['target'], train['id']
test = pd.read_csv("../input/test.csv")
test_id = test['id']
del test['id']
cat_fea = [x for x in list(train) if 'cat' in x]
bin_fea = [x for x in list(train) if 'bin' in x]
train['missing'] = (train==-1).sum(axis=1).astype(float)
test['missing'] = (test==-1).sum(axis=1).astype(float)
# include interactions
for e, (x, y) in enumerate(combinations(['ps_car_13', 'ps_ind_03', 'ps_reg_03', 'ps_ind_15', 'ps_reg_01', 'ps_ind_01'], 2)):
train, test = interaction_features(train, test, x, y, e)
num_features = [c for c in list(train) if ('cat' not in c and 'calc' not in c)]
num_features.append('missing')
inter_fea = [x for x in list(train) if 'inter' in x]
#train['cat_sum'] = train[cat_fea].sum(axis=1)
#test['cat_sum'] = test[cat_fea].sum(axis=1)
path = "../input/"
num_features_comb = []
for p in os.listdir(path):
if 'ps_reg_02___ps_car_07_cat' in p or 'ps_reg_01___ps_car_13___ps_car_15' in p:
print(p)
x,xt = pd.read_pickle(path+p)
train[p] = x
test[p] = xt
num_features_comb.append(p)
num_features += num_features_comb
feature_names = list(train)
ind_features = [c for c in feature_names if 'ind' in c]
count = 0
for c in ind_features:
if count == 0:
train['new_ind'] = train[c].astype(str)
count += 1
else:
train['new_ind'] += '_' + train[c].astype(str)
print(train['new_ind'].nunique())
ind_features = [c for c in feature_names if 'ind' in c]
count = 0
for c in ind_features:
if count == 0:
test['new_ind'] = test[c].astype(str)
count += 1
else:
test['new_ind'] += '_' + test[c].astype(str)
reg_features = [c for c in feature_names if 'reg' in c]
count = 0
for c in reg_features:
if count == 0:
train['new_reg'] = train[c].astype(str)
count += 1
else:
train['new_reg'] += '_' + train[c].astype(str)
print(train['new_reg'].nunique())
reg_features = [c for c in feature_names if 'reg' in c]
count = 0
for c in reg_features:
if count == 0:
test['new_reg'] = test[c].astype(str)
count += 1
else:
test['new_reg'] += '_' + test[c].astype(str)
car_features = [c for c in feature_names if 'car' in c]
count = 0
for c in car_features:
if count == 0:
train['new_car'] = train[c].astype(str)
count += 1
else:
train['new_car'] += '_' + train[c].astype(str)
print(train['new_car'].nunique())
car_features = [c for c in feature_names if 'car' in c]
count = 0
for c in car_features:
if count == 0:
test['new_car'] = test[c].astype(str)
count += 1
else:
test['new_car'] += '_' + test[c].astype(str)
new_ps_reg_03 = pd.read_pickle(path + 'new_ps_reg_03.pkl')
train['ps_reg_03'] = new_ps_reg_03[:train.shape[0]]
test['ps_reg_03'] = new_ps_reg_03[train.shape[0]:]
print(train['ps_reg_03'].head(10))
#X = train.as_matrix()
#X_test = test.as_matrix()
#print(X.shape, X_test.shape)
#ohe
from sklearn.preprocessing import LabelEncoder
train_cat = train[cat_fea]
train_num = train[[x for x in list(train) if x in num_features]]
test_cat = test[cat_fea]
test_num = test[[x for x in list(train) if x in num_features]]
max_cat_values = []
for c in cat_fea:
le = LabelEncoder()
x = le.fit_transform(pd.concat([train_cat, test_cat])[c])
train_cat[c] = le.transform(train_cat[c])
test_cat[c] = le.transform(test_cat[c])
max_cat_values.append(np.max(x))
#train_fea0, test_fea0 = pickle.load(open("../input/fea0.pk"))
cat_count_features = []
for c in cat_fea + ['new_ind','new_reg','new_car']:
d = pd.concat([train[c],test[c]]).value_counts().to_dict()
train['%s_count'%c] = train[c].apply(lambda x:d.get(x,0))
test['%s_count'%c] = test[c].apply(lambda x:d.get(x,0))
cat_count_features.append('%s_count'%c)
print(train_num.dtypes)
train_list = [train_num.replace([np.inf, -np.inf, np.nan], 0), train[cat_count_features]]#, np.ones(shape=(train_num.shape[0], 1))]
test_list = [test_num.replace([np.inf, -np.inf, np.nan], 0), test[cat_count_features]]#, np.ones(shape=(test_num.shape[0], 1))]
#proj
for t in ['ps_car_13', 'ps_ind_03', 'ps_reg_03', 'ps_ind_15', 'ps_reg_01', 'ps_ind_01']:
for g in ['ps_car_13', 'ps_ind_03', 'ps_reg_03', 'ps_ind_15', 'ps_reg_01', 'ps_ind_01', 'ps_ind_05_cat']:
if t != g:
s_train, s_test = proj_num_on_cat(train, test, target_column=t, group_column=g)
train_list.append(s_train)
test_list.append(s_test)
X = sparse.hstack(train_list).tocsr()
X_test = sparse.hstack(test_list).tocsr()
#X = train_num
#X_test = test_num
all_data = np.vstack([X.toarray(), X_test.toarray()])
#all_data = np.vstack([X, X_test])
scaler = StandardScaler()
scaler.fit(all_data)
X = scaler.transform(X.toarray())
X_test = scaler.transform(X_test.toarray())
#X = scaler.transform(X)
#X_test = scaler.transform(X_test)
print(X.shape, X_test.shape)
cv_train = np.zeros(len(train_label))
cv_pred = np.zeros(len(test_id))
def nn_model():
inputs = []
flatten_layers = []
for e, c in enumerate(cat_fea):
input_c = Input(shape=(1, ), dtype='int32')
num_c = max_cat_values[e]
embed_c = Embedding(
num_c,
64,
input_length=1
)(embed_c)
embed_c = Dropout(0.25)(embed_c)
flatten_c = Flatten()(embed_c)
inputs.append(input_c)
flatten_layers.append(flatten_c)
input_num = Input(shape=(X.shape[1],), dtype='float32')
flatten_layers.append(input_num)
inputs.append(input_num)
flatten = merge(flatten_layers, mode='concat')
fc1 = Dense(512, kernel_init='he_normal')(flatten)
fc1 = PReLU()(fc1)
fc1 = BatchNormalization()(fc1)
fc1 = Dropout(0.8)(fc1)
fc1 = Dense(64, kernel_init='he_normal')(fc1)
fc1 = PReLU()(fc1)
fc1 = BatchNormalization()(fc1)
fc1 = Dropout(0.8)(fc1)
outputs = Dense(1, kernel_init='he_normal', activation='sigmoid')(fc1)
model = Model(input = inputs, output = outputs)
model.compile(loss='binary_crossentropy', optimizer='adam')
return (model)
X_cat = train_cat.as_matrix()
X_test_cat = test_cat.as_matrix()
x_test_cat = []
for i in xrange(X_test_cat.shape[1]):
x_test_cat.append(X_test_cat[:, i].reshape(-1, 1))
x_test_cat.append(X_test)
num_seeds = 5
def nn_model():
inputs = []
flatten_layers = []
for e, c in enumerate(cat_fea):
input_c = Input(shape=(1, ), dtype='int32')
num_c = max_cat_values[e]
embed_c = Embedding(
num_c,
6,
input_length=1
)(input_c)
embed_c = Dropout(0.25)(embed_c)
flatten_c = Flatten()(embed_c)
inputs.append(input_c)
flatten_layers.append(flatten_c)
input_num = Input(shape=(X.shape[1],), dtype='float32')
flatten_layers.append(input_num)
inputs.append(input_num)
flatten = merge(flatten_layers, mode='concat')
fc1 = Dense(512, init='he_normal')(flatten)
fc1 = PReLU()(fc1)
fc1 = BatchNormalization()(fc1)
fc1 = Dropout(0.75)(fc1)
fc1 = Dense(64, init='he_normal')(fc1)
fc1 = PReLU()(fc1)
fc1 = BatchNormalization()(fc1)
fc1 = Dropout(0.5)(fc1)
outputs = Dense(1, init='he_normal', activation='sigmoid')(fc1)
model = Model(input = inputs, output = outputs)
model.compile(loss='binary_crossentropy', optimizer='adam')
return (model)
num_seeds = 5
begintime = time()
if cv_only:
for s in xrange(num_seeds):
np.random.seed(s)
for (inTr, inTe) in kfold.split(X, train_label):
xtr = X[inTr]
ytr = train_label[inTr]
xte = X[inTe]
yte = train_label[inTe]
xtr_cat = X_cat[inTr]
xte_cat = X_cat[inTe]
# get xtr xte cat
xtr_cat_list, xte_cat_list = [], []
for i in xrange(xtr_cat.shape[1]):
xtr_cat_list.append(xtr_cat[:, i].reshape(-1, 1))
xte_cat_list.append(xte_cat[:, i].reshape(-1, 1))
xtr_cat_list.append(xtr)
xte_cat_list.append(xte)
model = nn_model()
def get_rank(x):
return pd.Series(x).rank(pct=True).values
model.fit(xtr_cat_list, ytr, epochs=20, batch_size=512, verbose=2, validation_data=[xte_cat_list, yte])
cv_train[inTe] += get_rank(model.predict(x=xte_cat_list, batch_size=512, verbose=0)[:, 0])
print(Gini(train_label[inTe], cv_train[inTe]))
cv_pred += get_rank(model.predict(x=x_test_cat, batch_size=512, verbose=0)[:, 0])
print(s)
print(Gini(train_label, cv_train / (1. * (s + 1))))
print(str(datetime.timedelta(seconds=time() - begintime)))
if save_cv:
pd.DataFrame({'id': test_id, 'target': get_rank(cv_pred * 1./ (NFOLDS * num_seeds))}).to_csv('../model/keras6_pred.csv', index=False)
pd.DataFrame({'id': train_id, 'target': get_rank(cv_train * 1. / num_seeds)}).to_csv('../model/keras6_cv.csv', index=False)
================================================
FILE: code_for_exact_solution/keras7.py
================================================
import os
import sys
import operator
import numpy as np
import pandas as pd
from scipy import sparse
import xgboost as xgb
from sklearn import model_selection, preprocessing, ensemble
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation, Embedding, Flatten, concatenate, Input, merge
from keras.layers.normalization import BatchNormalization
from keras.layers.advanced_activations import PReLU
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import StratifiedKFold
from time import time
import datetime
from util import Gini, proj_num_on_cat, interaction_features, cat_count
from sklearn.preprocessing import StandardScaler
from itertools import combinations
import pickle
from keras.models import Model
'''
simple xgboost benchmark
'''
from sklearn.model_selection import StratifiedKFold, KFold
from sklearn.feature_selection import SelectPercentile, f_classif
import numpy as np
import pandas as pd
from util import Gini, proj_num_on_cat, interaction_features, cat_count
from sklearn.preprocessing import LabelEncoder
from itertools import combinations
from util import proj_num_on_cat
from sklearn.preprocessing import OneHotEncoder
import category_encoders as ce
from scipy import sparse
from sklearn.linear_model import LogisticRegression as LR
from sklearn.preprocessing import StandardScaler
from scipy.sparse import csr_matrix
import pickle
cv_only = True
save_cv = True
full_train = True
def evalerror(preds, dtrain):
labels = dtrain.get_label()
return 'gini', Gini(labels, preds)
NFOLDS = 5
kfold = StratifiedKFold(n_splits=NFOLDS, shuffle=True, random_state=218)
train = pd.read_csv("../input/train.csv")
train_label = train['target']
train_id = train['id']
del train['target'], train['id']
test = pd.read_csv("../input/test.csv")
test_id = test['id']
del test['id']
cat_fea = [x for x in list(train) if 'cat' in x]
bin_fea = [x for x in list(train) if 'bin' in x]
train['missing'] = (train==-1).sum(axis=1).astype(float)
test['missing'] = (test==-1).sum(axis=1).astype(float)
# include interactions
for e, (x, y) in enumerate(combinations(['ps_car_13', 'ps_ind_03', 'ps_reg_03', 'ps_ind_15', 'ps_reg_01', 'ps_ind_01'], 2)):
train, test = interaction_features(train, test, x, y, e)
num_features = [c for c in list(train) if ('cat' not in c and 'calc' not in c)]
num_features.append('missing')
inter_fea = [x for x in list(train) if 'inter' in x]
#train['cat_sum'] = train[cat_fea].sum(axis=1)
#test['cat_sum'] = test[cat_fea].sum(axis=1)
path = "../input/"
num_features_comb = []
for p in os.listdir(path):
if 'ps_reg_02___ps_car_07_cat' in p or 'ps_reg_01___ps_car_13___ps_car_15' in p:
print(p)
x,xt = pd.read_pickle(path+p)
train[p] = x
test[p] = xt
num_features_comb.append(p)
num_features += num_features_comb
feature_names = list(train)
ind_features = [c for c in feature_names if 'ind' in c]
count = 0
for c in ind_features:
if count == 0:
train['new_ind'] = train[c].astype(str)
count += 1
else:
train['new_ind'] += '_' + train[c].astype(str)
print(train['new_ind'].nunique())
ind_features = [c for c in feature_names if 'ind' in c]
count = 0
for c in ind_features:
if count == 0:
test['new_ind'] = test[c].astype(str)
count += 1
else:
test['new_ind'] += '_' + test[c].astype(str)
reg_features = [c for c in feature_names if 'reg' in c]
count = 0
for c in reg_features:
if count == 0:
train['new_reg'] = train[c].astype(str)
count += 1
else:
train['new_reg'] += '_' + train[c].astype(str)
print(train['new_reg'].nunique())
reg_features = [c for c in feature_names if 'reg' in c]
count = 0
for c in reg_features:
if count == 0:
test['new_reg'] = test[c].astype(str)
count += 1
else:
test['new_reg'] += '_' + test[c].astype(str)
car_features = [c for c in feature_names if 'car' in c]
count = 0
for c in car_features:
if count == 0:
train['new_car'] = train[c].astype(str)
count += 1
else:
train['new_car'] += '_' + train[c].astype(str)
print(train['new_car'].nunique())
car_features = [c for c in feature_names if 'car' in c]
count = 0
for c in car_features:
if count == 0:
test['new_car'] = test[c].astype(str)
count += 1
else:
test['new_car'] += '_' + test[c].astype(str)
new_ps_reg_03 = pd.read_pickle(path + 'new_ps_reg_03.pkl')
train['ps_reg_03'] = new_ps_reg_03[:train.shape[0]]
test['ps_reg_03'] = new_ps_reg_03[train.shape[0]:]
print(train['ps_reg_03'].head(10))
#X = train.as_matrix()
#X_test = test.as_matrix()
#print(X.shape, X_test.shape)
#ohe
from sklearn.preprocessing import LabelEncoder
train_cat = train[cat_fea]
train_num = train[[x for x in list(train) if x in num_features]]
test_cat = test[cat_fea]
test_num = test[[x for x in list(train) if x in num_features]]
max_cat_values = []
for c in cat_fea:
le = LabelEncoder()
x = le.fit_transform(pd.concat([train_cat, test_cat])[c])
train_cat[c] = le.transform(train_cat[c])
test_cat[c] = le.transform(test_cat[c])
max_cat_values.append(np.max(x))
train_fea0, test_fea0 = pickle.load(open("../input/fea0.pk"))
cat_count_features = []
for c in cat_fea + ['new_ind','new_reg','new_car']:
d = pd.concat([train[c],test[c]]).value_counts().to_dict()
train['%s_count'%c] = train[c].apply(lambda x:d.get(x,0))
test['%s_count'%c] = test[c].apply(lambda x:d.get(x,0))
cat_count_features.append('%s_count'%c)
print(train_num.dtypes)
train_list = [train_num.replace([np.inf, -np.inf, np.nan], 0), train[cat_count_features], train_fea0]#, np.ones(shape=(train_num.shape[0], 1))]
test_list = [test_num.replace([np.inf, -np.inf, np.nan], 0), test[cat_count_features], test_fea0]#, np.ones(shape=(test_num.shape[0], 1))]
#proj
for t in ['ps_car_13', 'ps_ind_03', 'ps_reg_03', 'ps_ind_15', 'ps_reg_01', 'ps_ind_01']:
for g in ['ps_car_13', 'ps_ind_03', 'ps_reg_03', 'ps_ind_15', 'ps_reg_01', 'ps_ind_01', 'ps_ind_05_cat']:
if t != g:
s_train, s_test = proj_num_on_cat(train, test, target_column=t, group_column=g)
train_list.append(s_train)
test_list.append(s_test)
X = sparse.hstack(train_list).tocsr()
X_test = sparse.hstack(test_list).tocsr()
#X = train_num
#X_test = test_num
all_data = np.vstack([X.toarray(), X_test.toarray()])
#all_data = np.vstack([X, X_test])
scaler = StandardScaler()
scaler.fit(all_data)
X = scaler.transform(X.toarray())
X_test = scaler.transform(X_test.toarray())
#X = scaler.transform(X)
#X_test = scaler.transform(X_test)
print(X.shape, X_test.shape)
cv_train = np.zeros(len(train_label))
cv_pred = np.zeros(len(test_id))
X_cat = train_cat.as_matrix()
X_test_cat = test_cat.as_matrix()
x_test_cat = []
for i in xrange(X_test_cat.shape[1]):
x_test_cat.append(X_test_cat[:, i].reshape(-1, 1))
x_test_cat.append(X_test)
X_cat = train_cat.as_matrix()
X_test_cat = test_cat.as_matrix()
x_test_cat = []
for i in xrange(X_test_cat.shape[1]):
x_test_cat.append(X_test_cat[:, i].reshape(-1, 1))
x_test_cat.append(X_test)
num_seeds = 5
def nn_model():
inputs = []
flatten_layers = []
for e, c in enumerate(cat_fea):
input_c = Input(shape=(1, ), dtype='int32')
num_c = max_cat_values[e]
embed_c = Embedding(
num_c,
3,
input_length=1
)(input_c)
embed_c = Dropout(0.)(embed_c)
flatten_c = Flatten()(embed_c)
inputs.append(input_c)
flatten_layers.append(flatten_c)
input_num = Input(shape=(X.shape[1],), dtype='float32')
flatten_layers.append(input_num)
inputs.append(input_num)
flatten = merge(flatten_layers, mode='concat')
fc1 = Dense(512, init='he_normal')(flatten)
fc1 = PReLU()(fc1)
fc1 = BatchNormalization()(fc1)
fc1 = Dropout(0.8)(fc1)
fc1 = Dense(64, init='he_normal')(fc1)
fc1 = PReLU()(fc1)
fc1 = BatchNormalization()(fc1)
fc1 = Dropout(0.8)(fc1)
outputs = Dense(1, init='he_normal', activation='sigmoid')(fc1)
model = Model(input = inputs, output = outputs)
model.compile(loss='binary_crossentropy', optimizer='adam')
return (model)
num_seeds = 3
begintime = time()
if cv_only:
for s in xrange(num_seeds):
np.random.seed(s)
for (inTr, inTe) in kfold.split(X, train_label):
xtr = X[inTr]
ytr = train_label[inTr]
xte = X[inTe]
yte = train_label[inTe]
xtr_cat = X_cat[inTr]
xte_cat = X_cat[inTe]
# get xtr xte cat
xtr_cat_list, xte_cat_list = [], []
for i in xrange(xtr_cat.shape[1]):
xtr_cat_list.append(xtr_cat[:, i].reshape(-1, 1))
xte_cat_list.append(xte_cat[:, i].reshape(-1, 1))
xtr_cat_list.append(xtr)
xte_cat_list.append(xte)
model = nn_model()
def get_rank(x):
return pd.Series(x).rank(pct=True).values
model.fit(xtr_cat_list, ytr, epochs=30, batch_size=512, verbose=2, validation_data=[xte_cat_list, yte])
cv_train[inTe] += get_rank(model.predict(x=xte_cat_list, batch_size=512, verbose=0)[:, 0])
print(Gini(train_label[inTe], cv_train[inTe]))
cv_pred += get_rank(model.predict(x=x_test_cat, batch_size=512, verbose=0)[:, 0])
print(s)
print(Gini(train_label, cv_train / (1. * (s + 1))))
print(str(datetime.timedelta(seconds=time() - begintime)))
pd.DataFrame({'id': test_id, 'target': get_rank(cv_pred * 1./ (NFOLDS * num_seeds))}).to_csv('../model/keras7_pred_0.csv', index=False)
pd.DataFrame({'id': train_id, 'target': get_rank(cv_train * 1. / num_seeds)}).to_csv('../model/keras7_cv_0.csv', index=False)
if save_cv:
pd.DataFrame({'id': test_id, 'target': get_rank(cv_pred * 1./ (NFOLDS * num_seeds))}).to_csv('../model/keras7_pred.csv', index=False)
pd.DataFrame({'id': train_id, 'target': get_rank(cv_train * 1. / num_seeds)}).to_csv('../model/keras7_cv.csv', index=False)
================================================
FILE: code_for_exact_solution/lightgbm1.py
================================================
'''
simple xgboost benchmark
'''
import lightgbm as lgbm
from sklearn.model_selection import StratifiedKFold
import numpy as np
import pandas as pd
from util import Gini
from sklearn.preprocessing import LabelEncoder
from itertools import combinations
from util import proj_num_on_cat, cat_count
from sklearn.preprocessing import OneHotEncoder
from scipy import sparse
cv_only = True
save_cv = True
full_train = False
def evalerror(preds, dtrain):
labels = dtrain.get_label()
return 'gini', Gini(labels, preds), True
NFOLDS = 5
kfold = StratifiedKFold(n_splits=NFOLDS, shuffle=True, random_state=218)
learning_rate = 0.1
num_leaves = 15
min_data_in_leaf = 2000
# max_bin = x
feature_fraction = 0.6
num_boost_round = 10000
train = pd.read_csv("../input/train.csv")
train_label = train['target']
train_id = train['id']
del train['target'], train['id']
test = pd.read_csv("../input/test.csv")
test_id = test['id']
del test['id']
cat_list = [x for x in list(train) if 'cat' in x]
print(cat_list)
train_copy = train.copy()
test_copy = test.copy()
train_copy = train_copy.replace(-1, np.NaN)
test_copy = test_copy.replace(-1, np.NaN)
train['num_na'] = train_copy.isnull().sum(axis=1)
test['num_na'] = test_copy.isnull().sum(axis=1)
del train_copy, test_copy
#X = train.as_matrix()
#X_test = test.as_matrix()
#print(X.shape, X_test.shape)
#ohe
ohe = OneHotEncoder(sparse=False)
train_cat = train[[x for x in list(train) if 'cat' in x]].as_matrix()
train_num = train[[x for x in list(train) if 'cat' not in x]]
test_cat = test[[x for x in list(test) if 'cat' in x]].as_matrix()
test_num = test[[x for x in list(test) if 'cat' not in x]]
train_cat[train_cat < 0] = 99
test_cat[test_cat < 0] = 99
train_ohe = ohe.fit_transform(train_cat)
test_ohe = ohe.transform(test_cat)
print("cat_list now:", cat_list)
train_cat_count, test_cat_count = cat_count(train, test, cat_list)
print("cat count shape:", train_cat_count.shape, test_cat_count.shape)
X = sparse.hstack([train_num, train_ohe, train_cat_count]).tocsr()
X_test = sparse.hstack([test_num, test_ohe, test_cat_count]).tocsr()
print(X.shape, X_test.shape)
params = {"objective": "binary",
"boosting_type": "gbdt",
"learning_rate": learning_rate,
"num_leaves": int(num_leaves),
"max_bin": 256,
"min_data_in_leaf": min_data_in_leaf,
"feature_fraction": feature_fraction,
"verbosity": 0,
"seed": 218,
"drop_rate": 0.1,
"is_unbalance": False,
"max_drop": 50,
"min_child_samples": 10,
"min_child_weight": 150,
"min_split_gain": 0,
"subsample": 0.9
}
x_score = []
final_cv_train = np.zeros(len(train_label))
final_cv_pred = np.zeros(len(test_id))
for s in xrange(16):
cv_train = np.zeros(len(train_label))
cv_pred = np.zeros(len(test_id))
params['seed'] = s
if cv_only:
kf = kfold.split(X, train_label)
best_trees = []
fold_scores = []
for i, (train_fold, validate) in enumerate(kf):
X_train, X_validate, label_train, label_validate = \
X[train_fold, :], X[validate, :], train_label[train_fold], train_label[validate]
dtrain = lgbm.Dataset(X_train, label_train)
dvalid = lgbm.Dataset(X_validate, label_validate, reference=dtrain)
bst = lgbm.train(params, dtrain, num_boost_round, valid_sets=dvalid, feval=evalerror, verbose_eval=100,
early_stopping_rounds=100)
best_trees.append(bst.best_iteration)
cv_pred += bst.predict(X_test, num_iteration=bst.best_iteration)
cv_train[validate] += bst.predict(X_validate)
score = Gini(label_validate, cv_train[validate])
print score
fold_scores.append(score)
cv_pred /= NFOLDS
final_cv_train += cv_train
final_cv_pred += cv_pred
print("cv score:")
print Gini(train_label, cv_train)
print "current score:", Gini(train_label, final_cv_train / (s + 1.)), s+1
print(fold_scores)
print(best_trees, np.mean(best_trees))
x_score.append(Gini(train_label, cv_train))
print(x_score)
pd.DataFrame({'id': test_id, 'target': final_cv_pred / 16.}).to_csv('../model/lgbm1_pred_avg.csv', index=False)
pd.DataFrame({'id': train_id, 'target': final_cv_train / 16.}).to_csv('../model/lgbm1_cv_avg.csv', index=False)
================================================
FILE: code_for_exact_solution/lightgbm5.py
================================================
import numpy as np
import scipy as sp
from scipy import sparse as ssp
import pandas as pd
from sklearn.preprocessing import normalize
from scipy.stats import spearmanr
import lightgbm as lgb
import lightgbm as lgbm
from sklearn.model_selection import StratifiedKFold
import numpy as np
import pandas as pd
from util import Gini
from sklearn.preprocessing import LabelEncoder
from itertools import combinations
from util import proj_num_on_cat
from sklearn.preprocessing import OneHotEncoder
from scipy import sparse
import lightgbm as lgbm
from sklearn.model_selection import StratifiedKFold
import numpy as np
import pandas as pd
from util import Gini
from sklearn.preprocessing import LabelEncoder
from itertools import combinations
from util import proj_num_on_cat, cat_count
from sklearn.preprocessing import OneHotEncoder
from scipy import sparse
cv_only = True
save_cv = True
full_train = False
import numpy as np
import scipy as sp
from scipy import sparse as ssp
import pandas as pd
import xgboost as xgb
from sklearn.model_selection import StratifiedKFold, KFold
import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from itertools import combinations
from util import proj_num_on_cat
from sklearn.preprocessing import OneHotEncoder
import category_encoders as ce
from scipy import sparse
from util import Gini, proj_num_on_cat, interaction_features, cat_count
def evalerror(preds, dtrain):
labels = dtrain.get_label()
return 'gini', Gini(labels, preds), True
path = "../input/"
train = pd.read_csv(path+'train.csv')
train_label = train['target']
train_id = train['id']
test = pd.read_csv(path+'test.csv')
test_id = test['id']
NFOLDS = 5
kfold = StratifiedKFold(n_splits=NFOLDS, shuffle=True, random_state=218)
learning_rate = 0.1
num_leaves = 15
min_data_in_leaf = 2000
# max_bin = x
feature_fraction = 0.6
num_boost_round = 10000
y = train['target'].values
drop_feature = [
'id',
'target'
]
X = train.drop(drop_feature,axis=1)
feature_names = X.columns.tolist()
cat_features = [c for c in feature_names if ('cat' in c and 'count' not in c)]
num_features = [c for c in feature_names if ('cat' not in c and 'calc' not in c)]
train['missing'] = (train==-1).sum(axis=1).astype(float)
test['missing'] = (test==-1).sum(axis=1).astype(float)
num_features.append('missing')
from sklearn.preprocessing import OneHotEncoder,LabelEncoder
for c in cat_features:
le = LabelEncoder()
le.fit(train[c])
train[c] = le.transform(train[c])
test[c] = le.transform(test[c])
from sklearn.preprocessing import normalize
from scipy.stats import spearmanr
from sklearn.preprocessing import OneHotEncoder
enc = OneHotEncoder()
enc.fit(train[cat_features])
X_cat = enc.transform(train[cat_features])
X_t_cat = enc.transform(test[cat_features])
ind_features = [c for c in feature_names if 'ind' in c]
count=0
for c in ind_features:
if count==0:
train['new_ind'] = train[c].astype(str)+'_'
test['new_ind'] = test[c].astype(str)+'_'
count+=1
else:
train['new_ind'] += train[c].astype(str)+'_'
test['new_ind'] += test[c].astype(str)+'_'
cat_count_features = []
for c in cat_features+['new_ind']:
d = pd.concat([train[c],test[c]]).value_counts().to_dict()
train['%s_count'%c] = train[c].apply(lambda x:d.get(x,0))
test['%s_count'%c] = test[c].apply(lambda x:d.get(x,0))
cat_count_features.append('%s_count'%c)
score = spearmanr(train['target'],train['%s_count'%c])
print(c,score)
train_list = [train[num_features+cat_count_features].values,X_cat,]
test_list = [test[num_features+cat_count_features].values,X_t_cat,]
# missing binary projections
#missing_list = [x for x in list(train) if np.sum(train[x] == -1) > 0]
#for miss_fea in missing_list:
# train['{}_miss_code'.format(miss_fea)] = (train[miss_fea] == -1).astype(int)
# test['{}_miss_code'.format(miss_fea)] = (test[miss_fea] == -1).astype(int)
X = ssp.hstack(train_list).tocsr()
X_test = ssp.hstack(test_list).tocsr()
params = {"objective": "poisson",
"boosting_type": "gbdt",
"learning_rate": learning_rate,
"num_leaves": int(num_leaves),
"max_bin": 256,
"feature_fraction": feature_fraction,
"verbosity": 0,
"drop_rate": 0.1,
"is_unbalance": False,
"max_drop": 50,
"min_child_samples": 10,
"min_child_weight": 150,
"min_split_gain": 0,
"subsample": 0.9
}
x_score = []
final_cv_train = np.zeros(len(train_label))
final_cv_pred = np.zeros(len(test_id))
for s in xrange(10):
cv_train = np.zeros(len(train_label))
cv_pred = np.zeros(len(test_id))
params['seed'] = s
if cv_only:
kf = kfold.split(X, train_label)
best_trees = []
fold_scores = []
for i, (train_fold, validate) in enumerate(kf):
X_train, X_validate, label_train, label_validate = \
X[train_fold, :], X[validate, :], train_label[train_fold], train_label[validate]
dtrain = lgbm.Dataset(X_train, label_train)
dvalid = lgbm.Dataset(X_validate, label_validate, reference=dtrain)
bst = lgbm.train(params, dtrain, num_boost_round, valid_sets=dvalid, feval=evalerror, verbose_eval=100,
early_stopping_rounds=100)
best_trees.append(bst.best_iteration)
cv_pred += bst.predict(X_test, num_iteration=bst.best_iteration)
cv_train[validate] += bst.predict(X_validate)
score = Gini(label_validate, cv_train[validate])
print score
fold_scores.append(score)
cv_pred /= NFOLDS
final_cv_train += cv_train
final_cv_pred += cv_pred
print("cv score:")
print Gini(train_label, cv_train)
print "current score:", Gini(train_label, final_cv_train / (s + 1.)), s+1
print(fold_scores)
print(best_trees, np.mean(best_trees))
x_score.append(Gini(train_label, cv_train))
print(x_score)
pred_result = pd.DataFrame({'id': test_id, 'target': final_cv_pred / 10.})
pred_result['target'] = pred_result['target'].rank(pct=True)
pred_result.to_csv('../model/lgbm5_pred_avg.csv', index=False)
cv_result = pd.DataFrame({'id': train_id, 'target': final_cv_train / 10.})
cv_result['target'] = cv_result['target'].rank(pct=True)
cv_result.to_csv('../model/lgbm5_cv_avg.csv', index=False)
#cv score:
#0.287007087138
#current score: 0.289683837899 16
================================================
FILE: code_for_exact_solution/lightgbm6.py
================================================
import numpy as np
import scipy as sp
from scipy import sparse as ssp
import pandas as pd
from sklearn.preprocessing import normalize
from scipy.stats import spearmanr
import lightgbm as lgb
import lightgbm as lgbm
from sklearn.model_selection import StratifiedKFold
import numpy as np
import pandas as pd
from util import Gini
from sklearn.preprocessing import LabelEncoder
from itertools import combinations
from util import proj_num_on_cat
from sklearn.preprocessing import OneHotEncoder
from scipy import sparse
import lightgbm as lgbm
from sklearn.model_selection import StratifiedKFold
import numpy as np
import pandas as pd
from util import Gini
from sklearn.preprocessing import LabelEncoder
from itertools import combinations
from util import proj_num_on_cat, cat_count
from sklearn.preprocessing import OneHotEncoder
from scipy import sparse
cv_only = True
save_cv = True
full_train = False
import numpy as np
import scipy as sp
from scipy import sparse as ssp
import pandas as pd
import xgboost as xgb
from sklearn.model_selection import StratifiedKFold, KFold
import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from itertools import combinations
from util import proj_num_on_cat
from sklearn.preprocessing import OneHotEncoder
import category_encoders as ce
from scipy import sparse
from util import Gini, proj_num_on_cat, interaction_features, cat_count
def evalerror(preds, dtrain):
labels = dtrain.get_label()
return 'gini', Gini(labels, preds), True
path = "../input/"
train = pd.read_csv(path+'train.csv')
train_label = train['target']
train_id = train['id']
test = pd.read_csv(path+'test.csv')
test_id = test['id']
NFOLDS = 5
kfold = StratifiedKFold(n_splits=NFOLDS, shuffle=True, random_state=218)
learning_rate = 0.1
num_leaves = 15
min_data_in_leaf = 2000
# max_bin = x
feature_fraction = 0.6
num_boost_round = 10000
y = train['target'].values
drop_feature = [
'id',
'target'
]
X = train.drop(drop_feature,axis=1)
feature_names = X.columns.tolist()
cat_features = [c for c in feature_names if ('cat' in c and 'count' not in c)]
num_features = [c for c in feature_names if ('cat' not in c and 'calc' not in c)]
train['missing'] = (train==-1).sum(axis=1).astype(float)
test['missing'] = (test==-1).sum(axis=1).astype(float)
num_features.append('missing')
from sklearn.preprocessing import OneHotEncoder,LabelEncoder
for c in cat_features:
le = LabelEncoder()
le.fit(train[c])
train[c] = le.transform(train[c])
test[c] = le.transform(test[c])
from sklearn.preprocessing import normalize
from scipy.stats import spearmanr
from sklearn.preprocessing import OneHotEncoder
enc = OneHotEncoder()
enc.fit(train[cat_features])
X_cat = enc.transform(train[cat_features])
X_t_cat = enc.transform(test[cat_features])
ind_features = [c for c in feature_names if 'ind' in c]
count=0
for c in ind_features:
if count==0:
train['new_ind'] = train[c].astype(str)+'_'
test['new_ind'] = test[c].astype(str)+'_'
count+=1
else:
train['new_ind'] += train[c].astype(str)+'_'
test['new_ind'] += test[c].astype(str)+'_'
cat_count_features = []
for c in cat_features+['new_ind']:
d = pd.concat([train[c],test[c]]).value_counts().to_dict()
train['%s_count'%c] = train[c].apply(lambda x:d.get(x,0))
test['%s_count'%c] = test[c].apply(lambda x:d.get(x,0))
cat_count_features.append('%s_count'%c)
score = spearmanr(train['target'],train['%s_count'%c])
print(c,score)
train_list = [train[num_features+cat_count_features].values,X_cat,]
test_list = [test[num_features+cat_count_features].values,X_t_cat,]
# missing binary projections
#missing_list = [x for x in list(train) if np.sum(train[x] == -1) > 0]
#for miss_fea in missing_list:
# train['{}_miss_code'.format(miss_fea)] = (train[miss_fea] == -1).astype(int)
# test['{}_miss_code'.format(miss_fea)] = (test[miss_fea] == -1).astype(int)
X = ssp.hstack(train_list).tocsr()
X_test = ssp.hstack(test_list).tocsr()
params = {"objective": "fair",
"boosting_type": "gbdt",
"learning_rate": learning_rate,
"num_leaves": int(num_leaves),
"max_bin": 256,
"feature_fraction": feature_fraction,
"verbosity": 0,
"drop_rate": 0.1,
"is_unbalance": False,
"max_drop": 50,
"min_child_samples": 10,
"min_child_weight": 150,
"min_split_gain": 0,
"subsample": 0.9
}
x_score = []
final_cv_train = np.zeros(len(train_label))
final_cv_pred = np.zeros(len(test_id))
for s in xrange(10):
cv_train = np.zeros(len(train_label))
cv_pred = np.zeros(len(test_id))
params['seed'] = s
if cv_only:
kf = kfold.split(X, train_label)
best_trees = []
fold_scores = []
for i, (train_fold, validate) in enumerate(kf):
X_train, X_validate, label_train, label_validate = \
X[train_fold, :], X[validate, :], train_label[train_fold], train_label[validate]
dtrain = lgbm.Dataset(X_train, label_train)
dvalid = lgbm.Dataset(X_validate, label_validate, reference=dtrain)
bst = lgbm.train(params, dtrain, num_boost_round, valid_sets=dvalid, feval=evalerror, verbose_eval=100,
early_stopping_rounds=100)
best_trees.append(bst.best_iteration)
cv_pred += bst.predict(X_test, num_iteration=bst.best_iteration)
cv_train[validate] += bst.predict(X_validate)
score = Gini(label_validate, cv_train[validate])
print score
fold_scores.append(score)
cv_pred /= NFOLDS
final_cv_train += cv_train
final_cv_pred += cv_pred
print("cv score:")
print Gini(train_label, cv_train)
print "current score:", Gini(train_label, final_cv_train / (s + 1.)), s+1
print(fold_scores)
print(best_trees, np.mean(best_trees))
x_score.append(Gini(train_label, cv_train))
print(x_score)
pred_result = pd.DataFrame({'id': test_id, 'target': final_cv_pred / 10.})
pred_result['target'] = pred_result['target'].rank(pct=True)
pred_result.to_csv('../model/lgbm6_pred_avg.csv', index=False)
cv_result = pd.DataFrame({'id': train_id, 'target': final_cv_train / 10.})
cv_result['target'] = cv_result['target'].rank(pct=True)
cv_result.to_csv('../model/lgbm6_cv_avg.csv', index=False)
#cv score:
#0.287007087138
#current score: 0.289683837899 16
================================================
FILE: code_for_exact_solution/lightgbm7.py
================================================
import numpy as np
import scipy as sp
from scipy import sparse as ssp
import pandas as pd
from sklearn.preprocessing import normalize
from scipy.stats import spearmanr
import lightgbm as lgb
import lightgbm as lgbm
from sklearn.model_selection import StratifiedKFold
import numpy as np
import pandas as pd
from util import Gini
from sklearn.preprocessing import LabelEncoder
from itertools import combinations
from util import proj_num_on_cat
from sklearn.preprocessing import OneHotEncoder
from scipy import sparse
import lightgbm as lgbm
from sklearn.model_selection import StratifiedKFold
import numpy as np
import pandas as pd
from util import Gini
from sklearn.preprocessing import LabelEncoder
from itertools import combinations
from util import proj_num_on_cat, cat_count
from sklearn.preprocessing import OneHotEncoder
from scipy import sparse
cv_only = True
save_cv = True
full_train = False
import numpy as np
import scipy as sp
from scipy import sparse as ssp
import pandas as pd
import xgboost as xgb
from sklearn.model_selection import StratifiedKFold, KFold
import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from itertools import combinations
from util import proj_num_on_cat
from sklearn.preprocessing import OneHotEncoder
import category_encoders as ce
from scipy import sparse
from util import Gini, proj_num_on_cat, interaction_features, cat_count
def evalerror(preds, dtrain):
labels = dtrain.get_label()
return 'gini', Gini(labels, preds), True
path = "../input/"
train = pd.read_csv(path+'train.csv')
train_label = train['target']
train_id = train['id']
test = pd.read_csv(path+'test.csv')
test_id = test['id']
NFOLDS = 5
kfold = StratifiedKFold(n_splits=NFOLDS, shuffle=True, random_state=218)
learning_rate = 0.1
num_leaves = 15
min_data_in_leaf = 2000
# max_bin = x
feature_fraction = 0.6
num_boost_round = 10000
y = train['target'].values
drop_feature = [
'id',
'target'
]
X = train.drop(drop_feature,axis=1)
feature_names = X.columns.tolist()
cat_features = [c for c in feature_names if ('cat' in c and 'count' not in c)]
num_features = [c for c in feature_names if ('cat' not in c and 'calc' not in c)]
train['missing'] = (train==-1).sum(axis=1).astype(float)
test['missing'] = (test==-1).sum(axis=1).astype(float)
num_features.append('missing')
from sklearn.preprocessing import OneHotEncoder,LabelEncoder
for c in cat_features:
le = LabelEncoder()
le.fit(train[c])
train[c] = le.transform(train[c])
test[c] = le.transform(test[c])
from sklearn.preprocessing import normalize
from scipy.stats import spearmanr
from sklearn.preprocessing import OneHotEncoder
enc = OneHotEncoder()
enc.fit(train[cat_features])
X_cat = enc.transform(train[cat_features])
X_t_cat = enc.transform(test[cat_features])
ind_features = [c for c in feature_names if 'ind' in c]
count=0
for c in ind_features:
if count==0:
train['new_ind'] = train[c].astype(str)+'_'
test['new_ind'] = test[c].astype(str)+'_'
count+=1
else:
train['new_ind'] += train[c].astype(str)+'_'
test['new_ind'] += test[c].astype(str)+'_'
cat_count_features = []
for c in cat_features+['new_ind']:
d = pd.concat([train[c],test[c]]).value_counts().to_dict()
train['%s_count'%c] = train[c].apply(lambda x:d.get(x,0))
test['%s_count'%c] = test[c].apply(lambda x:d.get(x,0))
cat_count_features.append('%s_count'%c)
score = spearmanr(train['target'],train['%s_count'%c])
print(c,score)
train_list = [train[num_features+cat_count_features].values,X_cat,]
test_list = [test[num_features+cat_count_features].values,X_t_cat,]
# missing binary projections
#missing_list = [x for x in list(train) if np.sum(train[x] == -1) > 0]
#for miss_fea in missing_list:
# train['{}_miss_code'.format(miss_fea)] = (train[miss_fea] == -1).astype(int)
# test['{}_miss_code'.format(miss_fea)] = (test[miss_fea] == -1).astype(int)
X = ssp.hstack(train_list).tocsr()
X_test = ssp.hstack(test_list).tocsr()
params = {"objective": "binary",
"boosting_type": "goss",
"learning_rate": learning_rate,
"num_leaves": int(num_leaves),
"max_bin": 256,
"feature_fraction": feature_fraction,
"verbosity": 0,
"drop_rate": 0.1,
"is_unbalance": False,
"max_drop": 50,
"min_child_samples": 10,
"min_child_weight": 150,
"min_split_gain": 0,
"subsample": 0.9
}
x_score = []
final_cv_train = np.zeros(len(train_label))
final_cv_pred = np.zeros(len(test_id))
for s in xrange(10):
cv_train = np.zeros(len(train_label))
cv_pred = np.zeros(len(test_id))
params['seed'] = s
if cv_only:
kf = kfold.split(X, train_label)
best_trees = []
fold_scores = []
for i, (train_fold, validate) in enumerate(kf):
X_train, X_validate, label_train, label_validate = \
X[train_fold, :], X[validate, :], train_label[train_fold], train_label[validate]
dtrain = lgbm.Dataset(X_train, label_train)
dvalid = lgbm.Dataset(X_validate, label_validate, reference=dtrain)
bst = lgbm.train(params, dtrain, num_boost_round, valid_sets=dvalid, feval=evalerror, verbose_eval=100,
early_stopping_rounds=100)
best_trees.append(bst.best_iteration)
cv_pred += bst.predict(X_test, num_iteration=bst.best_iteration)
cv_train[validate] += bst.predict(X_validate)
score = Gini(label_validate, cv_train[validate])
print score
fold_scores.append(score)
cv_pred /= NFOLDS
final_cv_train += cv_train
final_cv_pred += cv_pred
print("cv score:")
print Gini(train_label, cv_train)
print "current score:", Gini(train_label, final_cv_train / (s + 1.)), s+1
print(fold_scores)
print(best_trees, np.mean(best_trees))
x_score.append(Gini(train_label, cv_train))
print(x_score)
pred_result = pd.DataFrame({'id': test_id, 'target': final_cv_pred / 10.})
pred_result['target'] = pred_result['target'].rank(pct=True)
pred_result.to_csv('../model/lgbm7_pred_avg.csv', index=False)
cv_result = pd.DataFrame({'id': train_id, 'target': final_cv_train / 10.})
cv_result['target'] = cv_result['target'].rank(pct=True)
cv_result.to_csv('../model/lgbm7_cv_avg.csv', index=False)
#cv score:
#0.287007087138
#current score: 0.289683837899 16
================================================
FILE: code_for_exact_solution/lightgbm8.py
================================================
import numpy as np
import scipy as sp
from scipy import sparse as ssp
import pandas as pd
from sklearn.preprocessing import normalize
from scipy.stats import spearmanr
import lightgbm as lgb
import lightgbm as lgbm
from sklearn.model_selection import StratifiedKFold
import numpy as np
import pandas as pd
from util import Gini
from sklearn.preprocessing import LabelEncoder
from itertools import combinations
from util import proj_num_on_cat
from sklearn.preprocessing import OneHotEncoder
from scipy import sparse
import lightgbm as lgbm
from sklearn.model_selection import StratifiedKFold
import numpy as np
import pandas as pd
from util import Gini
from sklearn.preprocessing import LabelEncoder
from itertools import combinations
from util import proj_num_on_cat, cat_count
from sklearn.preprocessing import OneHotEncoder
from scipy import sparse
cv_only = True
save_cv = True
full_train = False
import numpy as np
import scipy as sp
from scipy import sparse as ssp
import pandas as pd
import xgboost as xgb
from sklearn.model_selection import StratifiedKFold, KFold
import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from itertools import combinations
from util import proj_num_on_cat
from sklearn.preprocessing import OneHotEncoder
import category_encoders as ce
from scipy import sparse
from util import Gini, proj_num_on_cat, interaction_features, cat_count
NFOLDS = 5
kfold = StratifiedKFold(n_splits=NFOLDS, shuffle=True, random_state=218)
learning_rate = 0.1
num_leaves = 15
min_data_in_leaf = 2000
# max_bin = x
feature_fraction = 0.6
num_boost_round = 10000
cv_only = True
save_cv = True
full_train = True
def evalerror(preds, dtrain):
labels = dtrain.get_label()
return 'gini', Gini(labels, preds), True
NFOLDS = 5
kfold = StratifiedKFold(n_splits=NFOLDS, shuffle=True, random_state=218)
train = pd.read_csv("../input/train.csv")
train_label = train['target']
train_id = train['id']
del train['target'], train['id']
test = pd.read_csv("../input/test.csv")
test_id = test['id']
del test['id']
cat_fea = [x for x in list(train) if 'cat' in x]
bin_fea = [x for x in list(train) if 'bin' in x]
train['missing'] = (train==-1).sum(axis=1).astype(float)
test['missing'] = (test==-1).sum(axis=1).astype(float)
# include interactions
for e, (x, y) in enumerate(combinations(['ps_car_13', 'ps_ind_03', 'ps_reg_03', 'ps_ind_15', 'ps_reg_01', 'ps_ind_01'], 2)):
train, test = interaction_features(train, test, x, y, e)
num_features = [c for c in list(train) if ('cat' not in c and 'calc' not in c)]
num_features.append('missing')
inter_fea = [x for x in list(train) if 'inter' in x]
#train['cat_sum'] = train[cat_fea].sum(axis=1)
#test['cat_sum'] = test[cat_fea].sum(axis=1)
path = "../input/"
num_features_comb = []
import os
for p in os.listdir(path):
if 'ps_reg_02___ps_car_07_cat' in p or 'ps_reg_01___ps_car_13___ps_car_15' in p:
print(p)
x,xt = pd.read_pickle(path+p)
train[p] = x
test[p] = xt
num_features_comb.append(p)
num_features += num_features_comb
feature_names = list(train)
ind_features = [c for c in feature_names if 'ind' in c]
count = 0
for c in ind_features:
if count == 0:
train['new_ind'] = train[c].astype(str)
count += 1
else:
train['new_ind'] += '_' + train[c].astype(str)
print(train['new_ind'].nunique())
ind_features = [c for c in feature_names if 'ind' in c]
count = 0
for c in ind_features:
if count == 0:
test['new_ind'] = test[c].astype(str)
count += 1
else:
test['new_ind'] += '_' + test[c].astype(str)
reg_features = [c for c in feature_names if 'reg' in c]
count = 0
for c in reg_features:
if count == 0:
train['new_reg'] = train[c].astype(str)
count += 1
else:
train['new_reg'] += '_' + train[c].astype(str)
print(train['new_reg'].nunique())
reg_features = [c for c in feature_names if 'reg' in c]
count = 0
for c in reg_features:
if count == 0:
test['new_reg'] = test[c].astype(str)
count += 1
else:
test['new_reg'] += '_' + test[c].astype(str)
car_features = [c for c in feature_names if 'car' in c]
count = 0
for c in car_features:
if count == 0:
train['new_car'] = train[c].astype(str)
count += 1
else:
train['new_car'] += '_' + train[c].astype(str)
print(train['new_car'].nunique())
car_features = [c for c in feature_names if 'car' in c]
count = 0
for c in car_features:
if count == 0:
test['new_car'] = test[c].astype(str)
count += 1
else:
test['new_car'] += '_' + test[c].astype(str)
new_ps_reg_03 = pd.read_pickle(path + 'new_ps_reg_03.pkl')
train['ps_reg_03'] = new_ps_reg_03[:train.shape[0]]
test['ps_reg_03'] = new_ps_reg_03[train.shape[0]:]
print(train['ps_reg_03'].head(10))
#X = train.as_matrix()
#X_test = test.as_matrix()
#print(X.shape, X_test.shape)
#ohe
ohe = OneHotEncoder(sparse=True)
train_cat = train[cat_fea].as_matrix()
train_num = train[[x for x in list(train) if x in num_features]]
test_cat = test[cat_fea].as_matrix()
test_num = test[[x for x in list(train) if x in num_features]]
train_cat[train_cat < 0] = 99
test_cat[test_cat < 0] = 99
traintest = np.vstack((train_cat, test_cat))
traintest = pd.DataFrame(traintest, columns=cat_fea)
print(traintest.shape)
#encoder = ce.HelmertEncoder(cols=cat_fea)
#encoder.fit(traintest)
#train_enc = encoder.transform(pd.DataFrame(train_cat, columns=cat_fea))
#test_enc = encoder.transform(pd.DataFrame(test_cat, columns=cat_fea))
ohe.fit(traintest)
train_ohe = ohe.transform(train_cat)
test_ohe = ohe.transform(test_cat)
del traintest
cat_count_features = []
for c in cat_fea + ['new_ind','new_reg','new_car']:
d = pd.concat([train[c],test[c]]).value_counts().to_dict()
train['%s_count'%c] = train[c].apply(lambda x:d.get(x,0))
test['%s_count'%c] = test[c].apply(lambda x:d.get(x,0))
cat_count_features.append('%s_count'%c)
print(train_num.dtypes)
train_list = [train_num.replace([np.inf, -np.inf, np.nan], 0), train_ohe, train[cat_count_features]]#, np.ones(shape=(train_num.shape[0], 1))]
test_list = [test_num.replace([np.inf, -np.inf, np.nan], 0), test_ohe, test[cat_count_features]]#, np.ones(shape=(test_num.shape[0], 1))]
#proj
for t in ['ps_car_13', 'ps_ind_03', 'ps_reg_03', 'ps_ind_15', 'ps_reg_01', 'ps_ind_01']:
for g in ['ps_car_13', 'ps_ind_03', 'ps_reg_03', 'ps_ind_15', 'ps_reg_01', 'ps_ind_01', 'ps_ind_05_cat']:
if t != g:
s_train, s_test = proj_num_on_cat(train, test, target_column=t, group_column=g)
train_list.append(s_train)
test_list.append(s_test)
X = sparse.hstack(train_list).tocsr()
X_test = sparse.hstack(test_list).tocsr()
params = {"objective": "binary",
"boosting_type": "gbdt",
"learning_rate": learning_rate,
"num_leaves": int(num_leaves),
"max_bin": 256,
"feature_fraction": feature_fraction,
"verbosity": 0,
"drop_rate": 0.1,
"is_unbalance": False,
"max_drop": 50,
"min_child_samples": 10,
"min_child_weight": 150,
"min_split_gain": 0,
"subsample": 0.9
}
x_score = []
final_cv_train = np.zeros(len(train_label))
final_cv_pred = np.zeros(len(test_id))
for s in xrange(8):
cv_train = np.zeros(len(train_label))
cv_pred = np.zeros(len(test_id))
params['seed'] = s
if cv_only:
kf = kfold.split(X, train_label)
best_trees = []
fold_scores = []
for i, (train_fold, validate) in enumerate(kf):
X_train, X_validate, label_train, label_validate = \
X[train_fold, :], X[validate, :], train_label[train_fold], train_label[validate]
dtrain = lgbm.Dataset(X_train, label_train)
dvalid = lgbm.Dataset(X_validate, label_validate, reference=dtrain)
bst = lgbm.train(params, dtrain, num_boost_round, valid_sets=dvalid, feval=evalerror, verbose_eval=100,
early_stopping_rounds=100)
best_trees.append(bst.best_iteration)
cv_pred += bst.predict(X_test, num_iteration=bst.best_iteration)
cv_train[validate] += bst.predict(X_validate)
score = Gini(label_validate, cv_train[validate])
print score
fold_scores.append(score)
cv_pred /= NFOLDS
final_cv_train += cv_train
final_cv_pred += cv_pred
print("cv score:")
print Gini(train_label, cv_train)
print "current score:", Gini(train_label, final_cv_train / (s + 1.)), s+1
print(fold_scores)
print(best_trees, np.mean(best_trees))
x_score.append(Gini(train_label, cv_train))
print(x_score)
pd.DataFrame({'id': test_id, 'target': final_cv_pred / 16.}).to_csv('../model/lgbm8_pred_avg.csv', index=False)
pd.DataFrame({'id': train_id, 'target': final_cv_train / 16.}).to_csv('../model/lgbm8_cv_avg.csv', index=False)
#cv score:
#0.287007087138
#current score: 0.289683837899 16
================================================
FILE: code_for_exact_solution/logistic1.py
================================================
'''
simple xgboost benchmark
'''
from sklearn.model_selection import StratifiedKFold, KFold
from sklearn.feature_selection import SelectPercentile, f_classif
import numpy as np
import pandas as pd
from util import Gini, proj_num_on_cat, interaction_features, cat_count
from sklearn.preprocessing import LabelEncoder
from itertools import combinations
from util import proj_num_on_cat
from sklearn.preprocessing import OneHotEncoder
import category_encoders as ce
from scipy import sparse
from sklearn.linear_model import LogisticRegression as LR
from sklearn.preprocessing import StandardScaler
from scipy.sparse import csr_matrix
import pickle
cv_only = True
save_cv = True
full_train = True
def evalerror(preds, dtrain):
labels = dtrain.get_label()
return 'gini', Gini(labels, preds)
NFOLDS = 5
kfold = StratifiedKFold(n_splits=NFOLDS, shuffle=True, random_state=218)
train = pd.read_csv("../input/train.csv")
train_label = train['target']
train_id = train['id']
del train['target'], train['id']
test = pd.read_csv("../input/test.csv")
test_id = test['id']
del test['id']
train_copy = train.copy()
test_copy = test.copy()
train_copy = train_copy.replace(-1, np.NaN)
test_copy = test_copy.replace(-1, np.NaN)
train['num_na'] = train_copy.isnull().sum(axis=1)
test['num_na'] = test_copy.isnull().sum(axis=1)
del train_copy, test_copy
cat_fea = [x for x in list(train) if 'cat' in x]
bin_fea = [x for x in list(train) if 'bin' in x]
train_cat_count, test_cat_count = cat_count(train, test, cat_fea)
# include interactions
for e, (x, y) in enumerate(combinations(['ps_car_13', 'ps_ind_03', 'ps_reg_03', 'ps_ind_15', 'ps_reg_01', 'ps_ind_01'], 2)):
train, test = interaction_features(train, test, x, y, e)
inter_fea = [x for x in list(train) if 'inter' in x]
#train['cat_sum'] = train[cat_fea].sum(axis=1)
#test['cat_sum'] = test[cat_fea].sum(axis=1)
#X = train.as_matrix()
#X_test = test.as_matrix()
#print(X.shape, X_test.shape)
#ohe
ohe = OneHotEncoder(sparse=True)
train_cat = train[cat_fea].as_matrix()
train_num = train[[x for x in list(train) if x not in cat_fea]]
test_cat = test[cat_fea].as_matrix()
test_num = test[[x for x in list(train) if x not in cat_fea]]
train_cat[train_cat < 0] = 99
test_cat[test_cat < 0] = 99
traintest = np.vstack((train_cat, test_cat))
traintest = pd.DataFrame(traintest, columns=cat_fea)
print(traintest.shape)
#encoder = ce.HelmertEncoder(cols=cat_fea)
#encoder.fit(traintest)
#train_enc = encoder.transform(pd.DataFrame(train_cat, columns=cat_fea))
#test_enc = encoder.transform(pd.DataFrame(test_cat, columns=cat_fea))
ohe.fit(traintest)
train_ohe = ohe.transform(train_cat)
test_ohe = ohe.transform(test_cat)
del traintest
train_fea0, test_fea0 = pickle.load(open("../input/fea0.pk"))
train_list = [train_num.replace([np.inf, -np.inf, np.nan], 0), train_ohe, train_fea0, train_cat_count]#, np.ones(shape=(train_num.shape[0], 1))]
test_list = [test_num.replace([np.inf, -np.inf, np.nan], 0), test_ohe, test_fea0, test_cat_count]#, np.ones(shape=(test_num.shape[0], 1))]
#proj
for t in ['ps_car_13', 'ps_ind_03', 'ps_reg_03', 'ps_ind_15', 'ps_reg_01', 'ps_ind_01']:
for g in ['ps_car_13', 'ps_ind_03', 'ps_reg_03', 'ps_ind_15', 'ps_reg_01', 'ps_ind_01', 'ps_ind_05_cat']:
if t != g:
s_train, s_test = proj_num_on_cat(train, test, target_column=t, group_column=g)
train_list.append(s_train)
test_list.append(s_test)
X = sparse.hstack(train_list).tocsr()
X_test = sparse.hstack(test_list).tocsr()
#X = train_num
#X_test = test_num
#all_data = np.vstack([X, X_test])
selector = SelectPercentile(f_classif, 75)
selector.fit(X.toarray(), train_label)
X, X_test = selector.transform(X.toarray()), selector.transform(X_test.toarray())
all_data = np.vstack([X, X_test])
scaler = StandardScaler()
scaler.fit(all_data)
X = csr_matrix(scaler.transform(X))
X_test = csr_matrix(scaler.transform(X_test))
#X = scaler.transform(X)
#X_test = scaler.transform(X_test)
print(X.shape, X_test.shape)
kf = kfold.split(X, train_label)
cv_train = np.zeros(len(train_label))
cv_pred = np.zeros(len(test_id))
best_trees = []
fold_scores = []
if cv_only:
for i, (train_fold, validate) in enumerate(kf):
X_train, X_validate, label_train, label_validate = \
X[train_fold, :], X[validate, :], train_label[train_fold], train_label[validate]
#selector = SelectPercentile(f_classif, X_train.toarray(), label_train)
#X_train, X_validate = csr_matrix(selector.transform(X_train.toarray())), csr_matrix(selector.transform(
# X_validate.toarray()
#))
clf = LR(C=25.)
clf.fit(X_train, label_train)
cv_pred += clf.predict_proba(X_test)[:, 1]
cv_train[validate] += clf.predict_proba(X_validate)[:, 1]
score = Gini(label_validate, cv_train[validate])
print score
fold_scores.append(score)
print("cv score:")
print Gini(train_label, cv_train)
print(fold_scores)
if save_cv:
pd.DataFrame({'id': test_id, 'target': cv_pred/NFOLDS}).to_csv('../model/logistic1_pred.csv', index=False)
pd.DataFrame({'id': train_id, 'target': cv_train}).to_csv('../model/logistic1_cv.csv', index=False)
#cv score:
#0.27906438918
#[0.28773918373071583, 0.26723600995806762, 0.28158062789785737, 0.27506435916773242, 0.28394519465855006]
if full_train:
clf = LR(C=25.)
clf.fit(X, train_label)
pred = clf.predict_proba(X_test)[:, 1]
pd.DataFrame({'id': test_id, 'target': pred}).to_csv('../model/logistic1_full_pred.csv', index=False)
================================================
FILE: code_for_exact_solution/rank_average.py
================================================
import pandas as pd
from util import Gini
def get_rank(x):
return pd.Series(x).rank(pct=True).values
train = pd.read_csv("../input/train.csv", usecols = ['target'])
keras3_train = pd.read_csv("../model/keras3_cv.csv")
keras5_train = pd.read_csv("../model/keras5_cv.csv")
keras6_train = pd.read_csv("../model/keras6_cv.csv")
keras7_train = pd.read_csv("../model/keras7_cv.csv")
lgbm1_train = pd.read_csv("../model/lgbm1_cv_avg.csv")
lgbm3_train = pd.read_csv("../model/lgbm3_cv_avg.csv")
lgbm8_train = pd.read_csv("../model/lgbm8_cv_avg.csv")
lgbm5_train = pd.read_csv("../model/lgbm5_cv_avg.csv")
lgbm6_train = pd.read_csv("../model/lgbm6_cv_avg.csv")
lgbm7_train = pd.read_csv("../model/lgbm7_cv_avg.csv")
logistic1_train = pd.read_csv("../model/logistic1_cv.csv")
xgb0_train = pd.read_csv("../model/xgb0_cv.csv")
keras3_test = pd.read_csv("../model/keras3_pred.csv")
keras5_test = pd.read_csv("../model/keras5_pred.csv")
keras6_test = pd.read_csv("../model/keras6_pred.csv")
keras7_test = pd.read_csv("../model/keras7_pred.csv")
lgbm1_test = pd.read_csv("../model/lgbm1_pred_avg.csv")
lgbm3_test = pd.read_csv("../model/lgbm3_pred_avg.csv")
lgbm8_test = pd.read_csv("../model/lgbm8_pred_avg.csv")
lgbm5_test = pd.read_csv("../model/lgbm5_pred_avg.csv")
lgbm6_test = pd.read_csv("../model/lgbm6_pred_avg.csv")
lgbm7_test = pd.read_csv("../model/lgbm7_pred_avg.csv")
logistic1_test = pd.read_csv("../model/logistic1_pred.csv")
xgb0_test = pd.read_csv("../model/xgb0_pred.csv")
xgblinear_train = pd.read_csv("../model/xgb0l_cv.csv")
xgblinear_test = pd.read_csv("../model/xgb0l_pred.csv")
result = get_rank(keras5_train['target']) * 0.4 + get_rank(lgbm3_train['target']) * 0.5 + \
get_rank(xgb0_train['target']) * 0.1 + get_rank(lgbm1_train['target']) * (-0.1) + \
get_rank(keras3_train['target']) * 0.1 + get_rank(logistic1_train['target']) * 0.1 + \
get_rank(xgblinear_train['target']) * 0.1 + get_rank(lgbm8_train['target']) * 0.25 + \
get_rank(lgbm5_train['target']) * 0.1 + \
get_rank(lgbm6_train['target']) * (-0.1) + get_rank(lgbm7_train['target']) * (0.1) + \
get_rank(keras6_train['target']) * (-0.1) + \
get_rank(keras7_train['target']) * 0.3
print "cv of final averaged model:", Gini(train['target'], result)
result = get_rank(keras5_test['target']) * 0.4 + get_rank(lgbm3_test['target']) * 0.5 + \
get_rank(xgb0_test['target']) * 0.1 + get_rank(lgbm1_test['target']) * (-0.1) + \
get_rank(keras3_test['target']) * 0.1 + get_rank(logistic1_test['target']) * 0.1 + \
get_rank(xgblinear_test['target']) * 0.1 + get_rank(lgbm8_test['target']) * 0.25 + \
get_rank(lgbm5_test['target']) * 0.1 + \
get_rank(lgbm6_test['target']) * (-0.1) + get_rank(lgbm7_test['target']) * (0.1) + \
get_rank(keras6_test['target']) * (-0.1) + \
get_rank(keras7_test['target']) * 0.3
pd.DataFrame({'id': keras5_test['id'], 'target': get_rank(result)}).to_csv("../model/all_average.csv", index = False)
================================================
FILE: code_for_exact_solution/util.py
================================================
import numpy as np
import pandas as pd
def Gini(y_true, y_pred):
# check and get number of samples
assert y_true.shape == y_pred.shape
n_samples = y_true.shape[0]
# sort rows on prediction column
# (from largest to smallest)
arr = np.array([y_true, y_pred]).transpose()
true_order = arr[arr[:, 0].argsort()][::-1, 0]
pred_order = arr[arr[:, 1].argsort()][::-1, 0]
# get Lorenz curves
L_true = np.cumsum(true_order) * 1. / np.sum(true_order)
L_pred = np.cumsum(pred_order) * 1. / np.sum(pred_order)
L_ones = np.linspace(1 / n_samples, 1, n_samples)
# get Gini coefficients (area between curves)
G_true = np.sum(L_ones - L_true)
G_pred = np.sum(L_ones - L_pred)
# normalize to true Gini coefficient
return G_pred * 1. / G_true
def cat_count(train_df, test_df, cat_list):
train_df['row_id'] = range(train_df.shape[0])
test_df['row_id'] = range(test_df.shape[0])
train_df['train'] = 1
test_df['train'] = 0
all_df = train_df[['row_id', 'train'] + cat_list].append(test_df[['row_id','train'] + cat_list])
for e, cat in enumerate(cat_list):
grouped = all_df[[cat]].groupby(cat)
the_size = pd.DataFrame(grouped.size()).reset_index()
the_size.columns = [cat, '{}_size'.format(cat)]
all_df = pd.merge(all_df, the_size, how='left')
selected_train = all_df[all_df['train'] == 1]
selected_test = all_df[all_df['train'] == 0]
selected_train.sort_values('row_id', inplace=True)
selected_test.sort_values('row_id', inplace=True)
selected_train.drop(['row_id', 'train'] + cat_list, axis=1, inplace=True)
selected_test.drop(['row_id', 'train'] + cat_list, axis=1, inplace=True)
selected_train, selected_test = np.array(selected_train), np.array(selected_test)
print(selected_train.shape, selected_test.shape)
return selected_train, selected_test
def proj_num_on_cat(train_df, test_df, target_column, group_column):
"""
:param train_df: train data frame
:param test_df: test data frame
:param target_column: name of numerical feature
:param group_column: name of categorical feature
"""
train_df['row_id'] = range(train_df.shape[0])
test_df['row_id'] = range(test_df.shape[0])
train_df['train'] = 1
test_df['train'] = 0
all_df = train_df[['row_id', 'train', target_column, group_column]].append(test_df[['row_id','train',
target_column, group_column]])
grouped = all_df[[target_column, group_column]].groupby(group_column)
the_size = pd.DataFrame(grouped.size()).reset_index()
the_size.columns = [group_column, '%s_size' % target_column]
the_mean = pd.DataFrame(grouped.mean()).reset_index()
the_mean.columns = [group_column, '%s_mean' % target_column]
the_std = pd.DataFrame(grouped.std()).reset_index().fillna(0)
the_std.columns = [group_column, '%s_std' % target_column]
the_median = pd.DataFrame(grouped.median()).reset_index()
the_median.columns = [group_column, '%s_median' % target_column]
the_stats = pd.merge(the_size, the_mean)
the_stats = pd.merge(the_stats, the_std)
the_stats = pd.merge(the_stats, the_median)
the_max = pd.DataFrame(grouped.max()).reset_index()
the_max.columns = [group_column, '%s_max' % target_column]
the_min = pd.DataFrame(grouped.min()).reset_index()
the_min.columns = [group_column, '%s_min' % target_column]
the_stats = pd.merge(the_stats, the_max)
the_stats = pd.merge(the_stats, the_min)
all_df = pd.merge(all_df, the_stats, how='left')
selected_train = all_df[all_df['train'] == 1]
selected_test = all_df[all_df['train'] == 0]
selected_train.sort_values('row_id', inplace=True)
selected_test.sort_values('row_id', inplace=True)
selected_train.drop([target_column, group_column, 'row_id', 'train'], axis=1, inplace=True)
selected_test.drop([target_column, group_column, 'row_id', 'train'], axis=1, inplace=True)
selected_train, selected_test = np.array(selected_train), np.array(selected_test)
print(selected_train.shape, selected_test.shape)
return selected_train, selected_test
def interaction_features(train, test, fea1, fea2, prefix):
train['inter_{}*'.format(prefix)] = train[fea1] * train[fea2]
train['inter_{}/'.format(prefix)] = train[fea1] / train[fea2]
test['inter_{}*'.format(prefix)] = test[fea1] * test[fea2]
test['inter_{}/'.format(prefix)] = test[fea1] / test[fea2]
return train, test
================================================
FILE: code_for_exact_solution/xgb0.py
================================================
'''
simple xgboost benchmark
'''
import xgboost as xgb
from sklearn.model_selection import StratifiedKFold, KFold
import numpy as np
import pandas as pd
from util import Gini
from sklearn.preprocessing import LabelEncoder
from itertools import combinations
from util import proj_num_on_cat
from sklearn.preprocessing import OneHotEncoder
import category_encoders as ce
from scipy import sparse
cv_only = True
save_cv = True
full_train = False
def evalerror(preds, dtrain):
labels = dtrain.get_label()
return 'gini', Gini(labels, preds)
NFOLDS = 5
kfold = StratifiedKFold(n_splits=NFOLDS, shuffle=True, random_state=218)
#kfold = KFold(n_splits=NFOLDS, shuffle=True, random_state=0)
eta = 0.05
max_depth = 7
subsample = 0.97
colsample_bytree = 0.85
gamma = 0.05
alpha = 0
min_child_weight = 55
#lamb = 0.35
colsample_bylevel = 0.8
num_boost_round = 10000
train = pd.read_csv("../input/train.csv")
train_label = train['target']
train_id = train['id']
del train['target'], train['id']
test = pd.read_csv("../input/test.csv")
test_id = test['id']
del test['id']
train_copy = train.copy()
test_copy = test.copy()
train_copy = train_copy.replace(-1, np.NaN)
test_copy = test_copy.replace(-1, np.NaN)
train['num_na'] = train_copy.isnull().sum(axis=1)
test['num_na'] = test_copy.isnull().sum(axis=1)
del train_copy, test_copy
cat_fea = [x for x in list(train) if 'cat' in x]
bin_fea = [x for x in list(train) if 'bin' in x]
#train['cat_sum'] = train[cat_fea].sum(axis=1)
#test['cat_sum'] = test[cat_fea].sum(axis=1)
#X = train.as_matrix()
#X_test = test.as_matrix()
#print(X.shape, X_test.shape)
#ohe
ohe = OneHotEncoder(sparse=True)
cat_fea = [x for x in list(train) if 'cat' in x]
train_cat = train[cat_fea].as_matrix()
train_num = train[[x for x in list(train) if x not in cat_fea]]
test_cat = test[cat_fea].as_matrix()
test_num = test[[x for x in list(train) if x not in cat_fea]]
train_cat[train_cat < 0] = 99
test_cat[test_cat < 0] = 99
traintest = np.vstack((train_cat, test_cat))
traintest = pd.DataFrame(traintest, columns=cat_fea)
print(traintest.shape)
#encoder = ce.HelmertEncoder(cols=cat_fea)
#encoder.fit(traintest)
#train_enc = encoder.transform(pd.DataFrame(train_cat, columns=cat_fea))
#test_enc = encoder.transform(pd.DataFrame(test_cat, columns=cat_fea))
ohe.fit(traintest)
train_ohe = ohe.transform(train_cat)
test_ohe = ohe.transform(test_cat)
del traintest
train_list = [train_num, train_ohe]#, np.ones(shape=(train_num.shape[0], 1))]
test_list = [test_num, test_ohe]#, np.ones(shape=(test_num.shape[0], 1))]
X = sparse.hstack(train_list).tocsr()
X_test = sparse.hstack(test_list).tocsr()
#X, X_test = X.toarray(), X_test.toarray()
print(X.shape, X_test.shape)
final_cv_train = np.zeros(len(train_label))
final_cv_pred = np.zeros(len(test_id))
final_best_trees = []
params = {"objective": "binary:logistic",
"booster": "gbtree",
"eta": eta,
"max_depth": int(max_depth),
"subsample": subsample,
"colsample_bytree": colsample_bytree,
"gamma": gamma,
#"lamb": lamb,
"alpha": alpha,
"min_child_weight": min_child_weight,
"colsample_bylevel": colsample_bylevel,
"silent": 1
}
if cv_only:
num_seeds = 24
for s in xrange(num_seeds):
print(s)
params['seed'] = s
kf = kfold.split(X, train_label)
cv_train = np.zeros(len(train_label))
cv_pred = np.zeros(len(test_id))
best_trees = []
fold_scores = []
for i, (train_fold, validate) in enumerate(kf):
X_train, X_validate, label_train, label_validate = \
X[train_fold, :], X[validate, :], train_label[train_fold], train_label[validate]
dtrain = xgb.DMatrix(X_train, label_train)
dvalid = xgb.DMatrix(X_validate, label_validate)
watchlist = [(dtrain, 'train'), (dvalid, 'valid')]
bst = xgb.train(params, dtrain, num_boost_round, evals=watchlist, feval=evalerror, verbose_eval=800,
early_stopping_rounds=25, maximize=True)
best_trees.append(bst.best_iteration)
cv_pred += bst.predict(xgb.DMatrix(X_test))
cv_train[validate] += bst.predict(xgb.DMatrix(X_validate), ntree_limit=bst.best_ntree_limit)
score = Gini(label_validate, cv_train[validate])
print score
fold_scores.append(score)
final_cv_train += cv_train
final_cv_pred += cv_pred
final_best_trees += best_trees
print("cv score:")
print Gini(train_label, cv_train)
print(fold_scores)
print(best_trees, np.mean(best_trees))
print("current score:", Gini(train_label, final_cv_train * 1. / (s + 1)), s+1)
final_cv_pred /= (NFOLDS * num_seeds)
final_cv_train /= num_seeds
pd.DataFrame({'id': test_id, 'target': final_cv_pred}).to_csv('../model/xgb_avg16_pred.csv', index=False)
pd.DataFrame({'id': train_id, 'target': final_cv_train}).to_csv('../model/xgb_avg16_cv.csv', index=False)
print(np.mean(final_best_trees), np.median(final_best_trees), np.std(final_best_trees))
## 0.1
#0.281739276885
#[0.28693135981084533, 0.26989064676756958, 0.28035898856108521, 0.28178381987103512, 0.29021910168396381]
#([123, 91, 139, 97, 92], 108.40000000000001)
#0.284350552387
#([1057, 933, 1175, 979, 1168], 1062.4000000000001)
if full_train:
for s in xrange(32):
params['seed'] = s
dtrain = xgb.DMatrix(X, train_label)
watchlist = [(dtrain, 'train')]
bst = xgb.train(params, dtrain, 100, evals=watchlist, feval=evalerror, verbose_eval=50, maximize=True)
pred = bst.predict(xgb.DMatrix(X_test))
if s == 0:
final_pred = pred
else:
final_pred += pred
pd.DataFrame({'id': test_id, 'target': final_pred / 32.}).to_csv('../model/xgb_avg_full_pred.csv', index=False)
================================================
FILE: code_for_exact_solution/xgb_linear0.py
================================================
import os
import sys
import operator
import numpy as np
import pandas as pd
from scipy import sparse
import xgboost as xgb
from sklearn import model_selection, preprocessing, ensemble
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation
from keras.layers.normalization import BatchNormalization
from keras.layers.advanced_activations import PReLU
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import StratifiedKFold
from time import time
import datetime
from util import Gini, proj_num_on_cat, interaction_features, cat_count
from sklearn.preprocessing import StandardScaler
from itertools import combinations
import pickle
'''
simple xgboost benchmark
'''
from sklearn.model_selection import StratifiedKFold, KFold
from sklearn.feature_selection import SelectPercentile, f_classif
import numpy as np
import pandas as pd
from util import Gini, proj_num_on_cat, interaction_features, cat_count
from sklearn.preprocessing import LabelEncoder
from itertools import combinations
from util import proj_num_on_cat
from sklearn.preprocessing import OneHotEncoder
import category_encoders as ce
from scipy import sparse
from sklearn.linear_model import LogisticRegression as LR
from sklearn.preprocessing import StandardScaler
from scipy.sparse import csr_matrix
import pickle
cv_only = True
save_cv = True
full_train = True
def evalerror(preds, dtrain):
labels = dtrain.get_label()
return 'gini', Gini(labels, preds)
NFOLDS = 5
kfold = StratifiedKFold(n_splits=NFOLDS, shuffle=True, random_state=218)
train = pd.read_csv("../input/train.csv")
train_label = train['target']
train_id = train['id']
del train['target'], train['id']
test = pd.read_csv("../input/test.csv")
test_id = test['id']
del test['id']
cat_fea = [x for x in list(train) if 'cat' in x]
bin_fea = [x for x in list(train) if 'bin' in x]
train['missing'] = (train==-1).sum(axis=1).astype(float)
test['missing'] = (test==-1).sum(axis=1).astype(float)
# include interactions
for e, (x, y) in enumerate(combinations(['ps_car_13', 'ps_ind_03', 'ps_reg_03', 'ps_ind_15', 'ps_reg_01', 'ps_ind_01'], 2)):
train, test = interaction_features(train, test, x, y, e)
num_features = [c for c in list(train) if ('cat' not in c and 'calc' not in c)]
num_features.append('missing')
inter_fea = [x for x in list(train) if 'inter' in x]
#train['cat_sum'] = train[cat_fea].sum(axis=1)
#test['cat_sum'] = test[cat_fea].sum(axis=1)
#X = train.as_matrix()
#X_test = test.as_matrix()
#print(X.shape, X_test.shape)
#ohe
ohe = OneHotEncoder(sparse=True)
train_cat = train[cat_fea].as_matrix()
train_num = train[[x for x in list(train) if x in num_features]]
test_cat = test[cat_fea].as_matrix()
test_num = test[[x for x in list(train) if x in num_features]]
train_cat[train_cat < 0] = 99
test_cat[test_cat < 0] = 99
traintest = np.vstack((train_cat, test_cat))
traintest = pd.DataFrame(traintest, columns=cat_fea)
print(traintest.shape)
#encoder = ce.HelmertEncoder(cols=cat_fea)
#encoder.fit(traintest)
#train_enc = encoder.transform(pd.DataFrame(train_cat, columns=cat_fea))
#test_enc = encoder.transform(pd.DataFrame(test_cat, columns=cat_fea))
ohe.fit(traintest)
train_ohe = ohe.transform(train_cat)
test_ohe = ohe.transform(test_cat)
del traintest
train_fea0, test_fea0 = pickle.load(open("../input/fea0.pk"))
train_fea1, test_fea1 = pickle.load(open("../input/fea0_lgb.pk"))
cat_count_features = []
for c in cat_fea:
d = pd.concat([train[c],test[c]]).value_counts().to_dict()
train['%s_count'%c] = train[c].apply(lambda x:d.get(x,0))
test['%s_count'%c] = test[c].apply(lambda x:d.get(x,0))
cat_count_features.append('%s_count'%c)
train_list = [train_num.replace([np.inf, -np.inf, np.nan], 0), train_ohe, train[cat_count_features]]#, np.ones(shape=(train_num.shape[0], 1))]
test_list = [test_num.replace([np.inf, -np.inf, np.nan], 0), test_ohe, test[cat_count_features]]#, np.ones(shape=(test_num.shape[0], 1))]
t_fea = ['ps_car_13', 'ps_ind_03', 'ps_reg_03', 'ps_ind_15', 'ps_reg_01', 'ps_ind_01']
g_fea = ['ps_car_13', 'ps_ind_03', 'ps_reg_03', 'ps_ind_15', 'ps_reg_01', 'ps_ind_01', 'ps_ind_05_cat'] + cat_fea
t_fea = list(set(t_fea))
g_fea = list(set(g_fea))
#proj
for t in t_fea:
for g in g_fea:
if t != g:
s_train, s_test = proj_num_on_cat(train, test, target_column=t, group_column=g)
train_list.append(s_train)
test_list.append(s_test)
X = sparse.hstack(train_list).tocsr()
X_test = sparse.hstack(test_list).tocsr()
#X = train_num
#X_test = test_num
all_data = np.vstack([X.toarray(), X_test.toarray()])
#all_data = np.vstack([X, X_test])
scaler = StandardScaler()
scaler.fit(all_data)
X = scaler.transform(X.toarray())
X_test = scaler.transform(X_test.toarray())
#X = scaler.transform(X)
#X_test = scaler.transform(X_test)
print(X.shape, X_test.shape)
final_cv_train = np.zeros(len(train_label))
final_cv_pred = np.zeros(len(test_id))
final_best_trees = []
eta = 0.1
lamb = 0.25
alpha = 1
num_boost_round = 10000
params = {"objective": "binary:logistic",
"booster": "gbtree",
"eta": eta,
"lamb": lamb,
"alpha": alpha,
"silent": 1
}
if cv_only:
num_seeds = 3
for s in xrange(num_seeds):
print(s)
params['seed'] = s
kf = kfold.split(X, train_label)
cv_train = np.zeros(len(train_label))
cv_pred = np.zeros(len(test_id))
best_trees = []
fold_scores = []
for i, (train_fold, validate) in enumerate(kf):
X_train, X_validate, label_train, label_validate = \
X[train_fold, :], X[validate, :], train_label[train_fold], train_label[validate]
dtrain = xgb.DMatrix(X_train, label_train)
dvalid = xgb.DMatrix(X_validate, label_validate)
watchlist = [(dtrain, 'train'), (dvalid, 'valid')]
bst = xgb.train(params, dtrain, num_boost_round, evals=watchlist, feval=evalerror, verbose_eval=100,
early_stopping_rounds=50, maximize=True)
best_trees.append(bst.best_iteration)
cv_pred += bst.predict(xgb.DMatrix(X_test))
cv_train[validate] += bst.predict(xgb.DMatrix(X_validate), ntree_limit=bst.best_ntree_limit)
score = Gini(label_validate, cv_train[validate])
print score
fold_scores.append(score)
final_cv_train += cv_train
final_cv_pred += cv_pred
final_best_trees += best_trees
print("cv score:")
print Gini(train_label, cv_train)
print(fold_scores)
print(best_trees, np.mean(best_trees))
print("current score:", Gini(train_label, final_cv_train * 1. / (s + 1)), s+1)
final_cv_pred /= (NFOLDS * num_seeds)
final_cv_train /= num_seeds
pd.DataFrame({'id': test_id, 'target': final_cv_pred}).to_csv('../model/xgb0l_pred.csv', index=False)
pd.DataFrame({'id': train_id, 'target': final_cv_train}).to_csv('../model/xgb0l_cv.csv', index=False)
print(np.mean(final_best_trees), np.median(final_best_trees), np.std(final_best_trees))
================================================
FILE: input/readme
================================================
unzipped data goes here
================================================
FILE: model/readme
================================================
models will be saved here
================================================
FILE: readme.md
================================================
### Requirements
*older or newer version of below packages should theoretically work fine*
python 2.7
numpy 1.13.3
pandas 0.20.3
sklearn 0.19.1
keras 2.1.1
tensorflow 1.4.0
xgboost 0.6
lightgbm 2.0.10
### How to reproduce
#### Simple solution (recommended)
Put unzipped data in `input`
*Generate a simple solution that is good enough for 2nd place (~0.2938 on private LB)*
`cd code`
`python fea_eng0.py`
`python nn_model290.py` to get a nn model that scores 0.290X
`python gbm_model291.py` to get a gbm model that scores 0.291X
`python simple_average.py` and then you can find the submission file at `../model/simple_average.csv`
You can reproduce this solution in a few hours.
#### Exact solution (Optional)
Although not recommended but you can also reproduce the exact same solution we submitted (0.29413 on private LB).
*you can follow these steps below, in addition to the simple solution*
```
cd ../code_for_exact_solution/
python keras3.py
python keras6.py
python keras7.py
python lightgbm1.py
python lightgbm5.py
python lightgbm6.py
python lightgbm7.py
python lightgbm8.py
python logistic1.py
python xgb0.py
python xgb_linear0.py
python rank_average.py
```
*It can take up to 2 days to generate the exact solution which only has 0.0003 improvement over the simple one*
gitextract_dcpwav_6/ ├── Jupyter_nnmodel/ │ ├── README.md │ ├── fea_eng0.py │ ├── feature_generater.py │ ├── nn_model .ipynb │ └── util.py ├── code/ │ ├── fea_eng0.py │ ├── gbm_model291.py │ ├── nn_model290.py │ ├── simple_average.py │ └── util.py ├── code_for_exact_solution/ │ ├── keras3.py │ ├── keras6.py │ ├── keras7.py │ ├── lightgbm1.py │ ├── lightgbm5.py │ ├── lightgbm6.py │ ├── lightgbm7.py │ ├── lightgbm8.py │ ├── logistic1.py │ ├── rank_average.py │ ├── util.py │ ├── xgb0.py │ └── xgb_linear0.py ├── input/ │ └── readme ├── model/ │ └── readme └── readme.md
SYMBOL INDEX (40 symbols across 19 files) FILE: Jupyter_nnmodel/feature_generater.py function Multiply_Divide (line 6) | def Multiply_Divide(train, test, features): function Series_string (line 24) | def Series_string(train, test, category_list): function Features_Counts (line 61) | def Features_Counts(train, test, features): function Statistic_features (line 73) | def Statistic_features(train, test, target_features, group_features): function features_type (line 84) | def features_type(train): FILE: Jupyter_nnmodel/util.py function Gini (line 4) | def Gini(y_true, y_pred): function cat_count (line 28) | def cat_count(train_df, test_df, cat_list): function proj_num_on_cat (line 52) | def proj_num_on_cat(train_df, test_df, target_column, group_column): function interaction_features (line 120) | def interaction_features(train, test, fea1, fea2, prefix): FILE: code/gbm_model291.py function Gini (line 9) | def Gini(y_true, y_pred): function evalerror (line 36) | def evalerror(preds, dtrain): FILE: code/nn_model290.py function nn_model (line 159) | def nn_model(): function get_rank (line 222) | def get_rank(x): FILE: code/simple_average.py function get_rank (line 8) | def get_rank(x): FILE: code/util.py function Gini (line 4) | def Gini(y_true, y_pred): function cat_count (line 28) | def cat_count(train_df, test_df, cat_list): function proj_num_on_cat (line 52) | def proj_num_on_cat(train_df, test_df, target_column, group_column): function interaction_features (line 100) | def interaction_features(train, test, fea1, fea2, prefix): FILE: code_for_exact_solution/keras3.py function evalerror (line 44) | def evalerror(preds, dtrain): function nn_model (line 141) | def nn_model(): FILE: code_for_exact_solution/keras6.py function evalerror (line 44) | def evalerror(preds, dtrain): function nn_model (line 213) | def nn_model(): function nn_model (line 262) | def nn_model(): function get_rank (line 325) | def get_rank(x): FILE: code_for_exact_solution/keras7.py function evalerror (line 44) | def evalerror(preds, dtrain): function nn_model (line 230) | def nn_model(): function get_rank (line 293) | def get_rank(x): FILE: code_for_exact_solution/lightgbm1.py function evalerror (line 19) | def evalerror(preds, dtrain): FILE: code_for_exact_solution/lightgbm5.py function evalerror (line 50) | def evalerror(preds, dtrain): FILE: code_for_exact_solution/lightgbm6.py function evalerror (line 50) | def evalerror(preds, dtrain): FILE: code_for_exact_solution/lightgbm7.py function evalerror (line 50) | def evalerror(preds, dtrain): FILE: code_for_exact_solution/lightgbm8.py function evalerror (line 62) | def evalerror(preds, dtrain): FILE: code_for_exact_solution/logistic1.py function evalerror (line 23) | def evalerror(preds, dtrain): FILE: code_for_exact_solution/rank_average.py function get_rank (line 4) | def get_rank(x): FILE: code_for_exact_solution/util.py function Gini (line 4) | def Gini(y_true, y_pred): function cat_count (line 28) | def cat_count(train_df, test_df, cat_list): function proj_num_on_cat (line 52) | def proj_num_on_cat(train_df, test_df, target_column, group_column): function interaction_features (line 100) | def interaction_features(train, test, fea1, fea2, prefix): FILE: code_for_exact_solution/xgb0.py function evalerror (line 20) | def evalerror(preds, dtrain): FILE: code_for_exact_solution/xgb_linear0.py function evalerror (line 44) | def evalerror(preds, dtrain):
Condensed preview — 26 files, each showing path, character count, and a content snippet. Download the .json file or copy for the full structured content (172K chars).
[
{
"path": "Jupyter_nnmodel/README.md",
"chars": 3295,
"preview": "# Content: Jupyter Version of 2nd place code kaggle-porto-seguro\n## [Porto Seguro’s Safe Driver Prediction](https://www."
},
{
"path": "Jupyter_nnmodel/fea_eng0.py",
"chars": 2370,
"preview": "\"\"\"xgb prediction as features\"\"\"\nimport xgboost as xgb\nfrom sklearn.model_selection import KFold\nimport numpy as np\nimpo"
},
{
"path": "Jupyter_nnmodel/feature_generater.py",
"chars": 3701,
"preview": "from util import proj_num_on_cat, Gini, interaction_features\nfrom itertools import combinations\nimport numpy as np\nimpor"
},
{
"path": "Jupyter_nnmodel/nn_model .ipynb",
"chars": 39564,
"preview": "{\n \"cells\": [\n {\n \"cell_type\": \"code\",\n \"execution_count\": 1,\n \"metadata\": {},\n \"outputs\": [\n {\n \"name\":"
},
{
"path": "Jupyter_nnmodel/util.py",
"chars": 5243,
"preview": "import numpy as np\nimport pandas as pd\n\ndef Gini(y_true, y_pred):\n # check and get number of samples\n assert y_tru"
},
{
"path": "code/fea_eng0.py",
"chars": 2239,
"preview": "\"\"\"xgb prediction as features\"\"\"\nimport xgboost as xgb\nfrom sklearn.model_selection import KFold\nimport numpy as np\nimpo"
},
{
"path": "code/gbm_model291.py",
"chars": 5342,
"preview": "import lightgbm as lgbm\nfrom scipy import sparse as ssp\nfrom sklearn.model_selection import StratifiedKFold\nimport numpy"
},
{
"path": "code/nn_model290.py",
"chars": 7688,
"preview": "from keras.layers import Dense, Dropout, Embedding, Flatten, Input, merge\nfrom keras.layers.normalization import BatchNo"
},
{
"path": "code/simple_average.py",
"chars": 439,
"preview": "'''\nsimple average of two models to get 2nd place\n'''\nimport pandas as pd\nkeras5_test = pd.read_csv(\"../model/keras5_pre"
},
{
"path": "code/util.py",
"chars": 4593,
"preview": "import numpy as np\nimport pandas as pd\n\ndef Gini(y_true, y_pred):\n # check and get number of samples\n assert y_tru"
},
{
"path": "code_for_exact_solution/keras3.py",
"chars": 6283,
"preview": "import os\nimport sys\nimport operator\nimport numpy as np\nimport pandas as pd\nfrom scipy import sparse\nimport xgboost as x"
},
{
"path": "code_for_exact_solution/keras6.py",
"chars": 10763,
"preview": "import os\nimport sys\nimport operator\nimport numpy as np\nimport pandas as pd\nfrom scipy import sparse\nimport xgboost as x"
},
{
"path": "code_for_exact_solution/keras7.py",
"chars": 10146,
"preview": "import os\nimport sys\nimport operator\nimport numpy as np\nimport pandas as pd\nfrom scipy import sparse\nimport xgboost as x"
},
{
"path": "code_for_exact_solution/lightgbm1.py",
"chars": 4446,
"preview": "'''\nsimple xgboost benchmark\n'''\nimport lightgbm as lgbm\nfrom sklearn.model_selection import StratifiedKFold\nimport nump"
},
{
"path": "code_for_exact_solution/lightgbm5.py",
"chars": 6505,
"preview": "import numpy as np\nimport scipy as sp\nfrom scipy import sparse as ssp\nimport pandas as pd\nfrom sklearn.preprocessing imp"
},
{
"path": "code_for_exact_solution/lightgbm6.py",
"chars": 6502,
"preview": "import numpy as np\nimport scipy as sp\nfrom scipy import sparse as ssp\nimport pandas as pd\nfrom sklearn.preprocessing imp"
},
{
"path": "code_for_exact_solution/lightgbm7.py",
"chars": 6504,
"preview": "import numpy as np\nimport scipy as sp\nfrom scipy import sparse as ssp\nimport pandas as pd\nfrom sklearn.preprocessing imp"
},
{
"path": "code_for_exact_solution/lightgbm8.py",
"chars": 9058,
"preview": "import numpy as np\nimport scipy as sp\nfrom scipy import sparse as ssp\nimport pandas as pd\nfrom sklearn.preprocessing imp"
},
{
"path": "code_for_exact_solution/logistic1.py",
"chars": 5561,
"preview": "'''\nsimple xgboost benchmark\n'''\nfrom sklearn.model_selection import StratifiedKFold, KFold\nfrom sklearn.feature_selecti"
},
{
"path": "code_for_exact_solution/rank_average.py",
"chars": 3001,
"preview": "import pandas as pd\nfrom util import Gini\n\ndef get_rank(x):\n return pd.Series(x).rank(pct=True).values\n\ntrain = pd.re"
},
{
"path": "code_for_exact_solution/util.py",
"chars": 4593,
"preview": "import numpy as np\nimport pandas as pd\n\ndef Gini(y_true, y_pred):\n # check and get number of samples\n assert y_tru"
},
{
"path": "code_for_exact_solution/xgb0.py",
"chars": 5950,
"preview": "'''\nsimple xgboost benchmark\n'''\nimport xgboost as xgb\nfrom sklearn.model_selection import StratifiedKFold, KFold\nimport"
},
{
"path": "code_for_exact_solution/xgb_linear0.py",
"chars": 7121,
"preview": "import os\nimport sys\nimport operator\nimport numpy as np\nimport pandas as pd\nfrom scipy import sparse\nimport xgboost as x"
},
{
"path": "input/readme",
"chars": 23,
"preview": "unzipped data goes here"
},
{
"path": "model/readme",
"chars": 25,
"preview": "models will be saved here"
},
{
"path": "readme.md",
"chars": 1299,
"preview": "### Requirements\n\n*older or newer version of below packages should theoretically work fine*\n\npython 2.7\n\nnumpy 1.13.3\n\np"
}
]
About this extraction
This page contains the full source code of the xiaozhouwang/kaggle-porto-seguro GitHub repository, extracted and formatted as plain text for AI agents and large language models (LLMs). The extraction includes 26 files (158.5 KB), approximately 48.6k tokens, and a symbol index with 40 extracted functions, classes, methods, constants, and types. Use this with OpenClaw, Claude, ChatGPT, Cursor, Windsurf, or any other AI tool that accepts text input. You can copy the full output to your clipboard or download it as a .txt file.
Extracted by GitExtract — free GitHub repo to text converter for AI. Built by Nikandr Surkov.