Repository: xiaozhouwang/kaggle-porto-seguro Branch: master Commit: 83d794f6dce6 Files: 26 Total size: 158.5 KB Directory structure: gitextract_dcpwav_6/ ├── Jupyter_nnmodel/ │ ├── README.md │ ├── fea_eng0.py │ ├── feature_generater.py │ ├── nn_model .ipynb │ └── util.py ├── code/ │ ├── fea_eng0.py │ ├── gbm_model291.py │ ├── nn_model290.py │ ├── simple_average.py │ └── util.py ├── code_for_exact_solution/ │ ├── keras3.py │ ├── keras6.py │ ├── keras7.py │ ├── lightgbm1.py │ ├── lightgbm5.py │ ├── lightgbm6.py │ ├── lightgbm7.py │ ├── lightgbm8.py │ ├── logistic1.py │ ├── rank_average.py │ ├── util.py │ ├── xgb0.py │ └── xgb_linear0.py ├── input/ │ └── readme ├── model/ │ └── readme └── readme.md ================================================ FILE CONTENTS ================================================ ================================================ FILE: Jupyter_nnmodel/README.md ================================================ # Content: Jupyter Version of 2nd place code kaggle-porto-seguro ## [Porto Seguro’s Safe Driver Prediction](https://www.kaggle.com/c/porto-seguro-safe-driver-prediction) in Kaggle Competition ### Popose The popose of this jupyter script is to make flow of code readable and easy to understand. Some code have been changed from the orignial author, but the concept of processing the 2nd place code is exactly same. For the some feature engineering processing, I put these codes into `feature_generater.py` ### Install This project requires **Python 2.7 or 3.6** and the following Python libraries installed: - [NumPy](http://www.numpy.org/) - [Pandas](http://pandas.pydata.org) - [Keras](http://matplotlib.org/) - [scikit-learn](http://scikit-learn.org/stable/) - [xgboost](https://xgboost.readthedocs.io/) - [pickle](https://www.tensorflow.org/) - [keras](https://keras.io/) - [itertools](https://docs.python.org/2/library/itertools.html) You will also need to have software installed to run and execute a [Jupyter Notebook](http://ipython.org/notebook.html) If you do not have Python installed yet, it is highly recommended that you install the [Anaconda](http://continuum.io/downloads) distribution of Python, which already has the above packages and more included. Make sure that you select the Python 2.7 installer and not the Python 3.x installer. ### Code 1. Run `fea_eng0.py` to get your first features as a pickle file `fea0.pk`. Note: if you are using python 3, you need to switch the code, `fea_eng0.py`, in the last line. So the pickle file would be able to read in `nn_model.ipynb`. 2. the manin code is provided in the `nn_model.ipynb` notebook file. You will also be required to use the included `util.py` and `feature_generater.py` Python files, the `train.csv` and `test.csv` dataset file,which you have to download from [Kaggle](https://www.kaggle.com/c/porto-seguro-safe-driver-prediction/data) into the input folder, to complete your work. While some code has already been implemented to get you started, you will need to implement additional functionality when requested to successfully complete the. During the operation of `nn_model.ipynb` in , the defualt output will be created in model folder. If you are interested in `util.py` and `feature_generater.py`, please feel free to explore these Python files. ### Run In a terminal or command window, navigate to the top-level project directory `Jupyter_Version/` (that contains this README) and run one of the following commands: ```bash ipython notebook nn_model.ipynb ``` or ```bash jupyter notebook nn_model.ipynb ``` This will open the Jupyter Notebook software and project file in your browser. ## Data You can download data from [Kaggle](https://www.kaggle.com/c/porto-seguro-safe-driver-prediction/data) In the train and test data, features that belong to similar groupings are tagged as such in the feature names (e.g., ind, reg, car, calc). In addition, feature names include the postfix bin to indicate binary features and cat to indicate categorical features. Features without these designations are either continuous or ordinal. Values of -1 indicate that the feature was missing from the observation. The target columns signifies whether or not a claim was filed for that policy holder. ================================================ FILE: Jupyter_nnmodel/fea_eng0.py ================================================ """xgb prediction as features""" import xgboost as xgb from sklearn.model_selection import KFold import numpy as np import pandas as pd eta = 0.1 max_depth = 6 subsample = 0.9 colsample_bytree = 0.85 min_child_weight = 55 num_boost_round = 500 train = pd.read_csv("../input/train.csv") train_label = train['target'] train_id = train['id'] del train['target'], train['id'] test = pd.read_csv("../input/test.csv") test_id = test['id'] del test['id'] params = {"objective": "reg:linear", "booster": "gbtree", "eta": eta, "max_depth": int(max_depth), "subsample": subsample, "colsample_bytree": colsample_bytree, "min_child_weight": min_child_weight, "silent": 1 } data = train.append(test) data.reset_index(inplace=True) train_rows = train.shape[0] feature_results = [] for target_g in ['car', 'ind', 'reg']: features = [x for x in list(data) if target_g not in x] target_list = [x for x in list(data) if target_g in x] train_fea = np.array(data[features]) for target in target_list: print(target) train_label = data[target] kfold = KFold(n_splits=5, random_state=218, shuffle=True) kf = kfold.split(data) cv_train = np.zeros(shape=(data.shape[0], 1)) for i, (train_fold, validate) in enumerate(kf): X_train, X_validate, label_train, label_validate = \ train_fea[train_fold, :], train_fea[validate, :], train_label[train_fold], train_label[validate] dtrain = xgb.DMatrix(X_train, label_train) dvalid = xgb.DMatrix(X_validate, label_validate) watchlist = [(dtrain, 'train'), (dvalid, 'valid')] bst = xgb.train(params, dtrain, num_boost_round, evals=watchlist, verbose_eval=50, early_stopping_rounds=10) cv_train[validate, 0] += bst.predict(xgb.DMatrix(X_validate), ntree_limit=bst.best_ntree_limit) feature_results.append(cv_train) feature_results = np.hstack(feature_results) train_features = feature_results[:train_rows, :] test_features = feature_results[train_rows:, :] import pickle #for python 2 pickle.dump([train_features, test_features], open("../input/fea0.pk", 'wb'),protocol=2) #for python 3 # pickle.dump([train_features, test_features], open("../input/fea0.pk", 'wb'),protocol=3) ================================================ FILE: Jupyter_nnmodel/feature_generater.py ================================================ from util import proj_num_on_cat, Gini, interaction_features from itertools import combinations import numpy as np import pandas as pd def Multiply_Divide(train, test, features): """ combinations: combinations(['A', 'B','C'],2) retrun AB AC BC combinations(range(4), 3) --> 012 013 023 123 """ feature_names= [] for e, (x, y) in enumerate(combinations(features, 2)): train, test, feature_name= interaction_features(train, test, x, y, e) for name in feature_name: feature_names.append(name) return train, test, feature_names def Series_string(train, test, category_list): ''' produce series as a string like new_ind as the following id new_ind_count new_ind 595207 117 3_1_10_0_0_0_0_0_1_0_0_0_0_0_13_1_0_0 595208 153 5_1_3_0_0_0_0_0_1_0_0_0_0_0_6_1_0_0 return train and test with new colunes of new_categories ''' for category in category_list: feature_names = list(train.columns) features = [c for c in feature_names if category in c] name= 'new_'+ category count = 0 for c in features: if count == 0: train[name] = train[c].astype(str) count += 1 else: train[name] += '_' + train[c].astype(str) count = 0 for c in features: if count == 0: test[name] = test[c].astype(str) count += 1 else: test[name] += '_' + test[c].astype(str) return train, test def Features_Counts(train, test, features): feature_names =[] for c in features: d = pd.concat([train[c],test[c]]).value_counts().to_dict() train['%s_count'%c] = train[c].apply(lambda x:d.get(x,0)) test['%s_count'%c] = test[c].apply(lambda x:d.get(x,0)) feature_names.append('%s_count'%c) return train, test, feature_names def Statistic_features(train, test, target_features, group_features): train_list_=[] test_list_=[] for t in target_features: for g in group_features: if t != g: s_train, s_test = proj_num_on_cat(train, test, target_column=t, group_column=g) train_list_.append(s_train) test_list_.append(s_test) return np.hstack(train_list_), np.hstack(test_list_) def features_type(train): data = [] for f in train.columns: # Defining the level if 'bin' in f or f == 'target': level = 'binary' elif 'cat' in f or f == 'id': level = 'nominal' elif train[f].dtype == float: level = 'interval' elif train[f].dtype == int: level = 'ordinal' # Initialize keep to True for all variables except for id keep = True if f == 'id': keep = False # Defining the data type dtype = train[f].dtype # Creating a Dict that contains all the metadata for the variable f_dict = { 'varname': f, 'level': level, 'keep': keep, 'dtype': dtype } data.append(f_dict) meta = pd.DataFrame(data, columns=['varname', 'level', 'keep', 'dtype']) meta.set_index('varname', inplace=True) interval = meta[(meta.level == 'interval') & (meta.keep)].index ordinal = meta[(meta.level == 'ordinal') & (meta.keep)].index binary = meta[(meta.level == 'binary') & (meta.keep)].index nominal = meta[(meta.level == 'nominal') & (meta.keep)].index return interval, ordinal, binary, nominal ================================================ FILE: Jupyter_nnmodel/nn_model .ipynb ================================================ { "cells": [ { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/Users/stevenhu/anaconda3/lib/python3.6/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.\n", " from ._conv import register_converters as _register_converters\n", "Using TensorFlow backend.\n" ] } ], "source": [ "#homemade script\n", "from util import Gini\n", "from feature_generater import Multiply_Divide, Series_string, Features_Counts, Statistic_features\n", "\n", "import numpy as np\n", "import pandas as pd\n", "import pickle\n", "from scipy import sparse\n", "from sklearn.preprocessing import StandardScaler\n", "from sklearn.preprocessing import LabelEncoder\n", "\n", "#for NN model\n", "from keras.layers import Dense, Dropout, Embedding, Flatten, Input, Concatenate, merge\n", "from keras.layers.normalization import BatchNormalization\n", "from keras.layers.advanced_activations import PReLU\n", "from keras.models import Model\n", "from time import time\n", "import datetime\n", "from sklearn.model_selection import StratifiedKFold\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 1. Load Data #\n", "\n", "- Create train, test dataset\n", "- Create train target label\n", "- Create feature object: cat, num, bin, inter\n", "- Create feature columns in train: counting of miss values\n" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "cv_only = True\n", "save_cv = True\n", "\n", "#read data\n", "train = pd.read_csv(\"../input/train.csv\")\n", "train_label = train['target']\n", "train_id = train['id']\n", "del train['target'], train['id']\n", "\n", "test = pd.read_csv(\"../input/test.csv\")\n", "test_id = test['id']\n", "del test['id']\n", "\n", "\n", "\n", "#find missing value by each row and recode to column 'missing'\n", "train['missing'] = (train==-1).sum(axis=1).astype(float)\n", "test['missing'] = (test==-1).sum(axis=1).astype(float)\n", "\n", "#get all featrue name\n", "feature_names = list(train)\n", "\n", "# extract feature with cat, bin, num, inter\n", "cat_fea = [x for x in list(train) if 'cat' in x]\n", "bin_fea = [x for x in list(train) if 'bin' in x]\n", "num_features = [c for c in list(train) if ('cat' not in c and 'calc' not in c)]\n", "inter_fea = [x for x in list(train) if 'inter' in x]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 2. Feature Engineering #\n", "\n", "- Create Multipy and Divide feature\n", "- Feature of Counts(target incoding)\n", "- Load feature generated from Feature Engine\n", "- Create Statistic features\n", "- Combine all feature together, and get ready for training\n", "- Create Cat_feature for NN embeding training" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2.1 Multipy and Divide feature ##\n", "- moltipy each feature in the list and created new columns into train and testing data set" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "#Add features of Multiply and Divide\n", "features= ['ps_car_13', 'ps_ind_03', 'ps_reg_03', 'ps_ind_15', 'ps_reg_01', 'ps_ind_01']\n", "train, test, MD_features = Multiply_Divide(train, test, features)\n" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
inter_0*inter_0/inter_1*inter_1/inter_2*inter_2/inter_3*inter_3/inter_4*inter_4/...inter_10*inter_10/inter_11*inter_11/inter_12*inter_12/inter_13*inter_13/inter_14*inter_14/
04.4183950.1767360.6345441.2306309.7204680.0803340.6185751.2623981.7673580.441839...0.5026491.0258151.4361410.3590357.715.714286225.5000001.40.350000
14.3317160.0884020.4740620.8077731.8564500.2062720.4950530.7735210.6188170.618817...0.6128620.9575970.7660780.7660782.43.75000033.0000000.80.800000
25.7742710.071287-0.641586-0.6415867.6990290.0534650.000000inf3.2079290.128317...-0.000000-inf-5.000000-0.2000000.0inf602.4000000.00.000000
31.0858980.2714740.3154250.9345924.3435900.0678690.4886540.6032760.000000inf...0.5228530.6454970.000000inf7.28.8888890inf0.0inf
40.000000inf0.4757280.6730015.0924840.0628700.3960820.8083310.000000inf...0.5885311.2010840.000000inf6.312.8571430inf0.0inf
\n", "

5 rows × 30 columns

\n", "
" ], "text/plain": [ " inter_0* inter_0/ inter_1* inter_1/ inter_2* inter_2/ inter_3* \\\n", "0 4.418395 0.176736 0.634544 1.230630 9.720468 0.080334 0.618575 \n", "1 4.331716 0.088402 0.474062 0.807773 1.856450 0.206272 0.495053 \n", "2 5.774271 0.071287 -0.641586 -0.641586 7.699029 0.053465 0.000000 \n", "3 1.085898 0.271474 0.315425 0.934592 4.343590 0.067869 0.488654 \n", "4 0.000000 inf 0.475728 0.673001 5.092484 0.062870 0.396082 \n", "\n", " inter_3/ inter_4* inter_4/ ... inter_10* inter_10/ inter_11* \\\n", "0 1.262398 1.767358 0.441839 ... 0.502649 1.025815 1.436141 \n", "1 0.773521 0.618817 0.618817 ... 0.612862 0.957597 0.766078 \n", "2 inf 3.207929 0.128317 ... -0.000000 -inf -5.000000 \n", "3 0.603276 0.000000 inf ... 0.522853 0.645497 0.000000 \n", "4 0.808331 0.000000 inf ... 0.588531 1.201084 0.000000 \n", "\n", " inter_11/ inter_12* inter_12/ inter_13* inter_13/ inter_14* inter_14/ \n", "0 0.359035 7.7 15.714286 22 5.500000 1.4 0.350000 \n", "1 0.766078 2.4 3.750000 3 3.000000 0.8 0.800000 \n", "2 -0.200000 0.0 inf 60 2.400000 0.0 0.000000 \n", "3 inf 7.2 8.888889 0 inf 0.0 inf \n", "4 inf 6.3 12.857143 0 inf 0.0 inf \n", "\n", "[5 rows x 30 columns]" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "display(train[MD_features].head(5))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2.2 Feature of Counts ##\n", "1. Generate new_ind, new_reg, new_car\n", "2. Count the number of distinct values of\n", " cat features, new_ind, new_reg and new_car" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "\n", "'''\n", "create 1_0_1_1..... data as new_xxx\n", "\n", "new_ind: collect all data from all relative \"ind\" columns, then generate series number\n", "\n", "new_reg, new_car for train and test data \n", "For RNN processing, generating a sequence number\n", "'''\n", "\n", "\n", "category_list = ['ind', 'reg', 'car']\n", "#add 'new_ind','new_reg','new_car' in train and test dataset\n", "train, test = Series_string(train,test,category_list )\n", "\n" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
new_indnew_regnew_car
02_2_5_1_0_0_1_0_0_0_0_0_0_0_11_0_1_00.7_0.2_0.718070330810_1_-1_0_1_4_1_0_0_1_12_2_0.4_0.8836789178_0....
11_1_7_0_0_0_0_1_0_0_0_0_0_0_3_0_0_10.8_0.4_0.766077672311_1_-1_0_-1_11_1_1_2_1_19_3_0.316227766_0.618...
25_4_9_1_0_0_0_1_0_0_0_0_0_0_12_1_0_00.0_0.0_-1.07_1_-1_0_-1_14_1_1_2_1_60_1_0.316227766_0.6415...
30_1_2_0_0_1_0_0_0_0_0_0_0_0_8_1_0_00.9_0.2_0.58094750197_1_0_0_1_11_1_1_3_1_104_1_0.3741657387_0.5429...
40_2_0_1_0_1_0_0_0_0_0_0_0_0_9_1_0_00.7_0.6_0.84075858611_1_-1_0_-1_14_1_1_2_1_82_3_0.3160696126_0.56...
\n", "
" ], "text/plain": [ " new_ind new_reg \\\n", "0 2_2_5_1_0_0_1_0_0_0_0_0_0_0_11_0_1_0 0.7_0.2_0.7180703308 \n", "1 1_1_7_0_0_0_0_1_0_0_0_0_0_0_3_0_0_1 0.8_0.4_0.7660776723 \n", "2 5_4_9_1_0_0_0_1_0_0_0_0_0_0_12_1_0_0 0.0_0.0_-1.0 \n", "3 0_1_2_0_0_1_0_0_0_0_0_0_0_0_8_1_0_0 0.9_0.2_0.5809475019 \n", "4 0_2_0_1_0_1_0_0_0_0_0_0_0_0_9_1_0_0 0.7_0.6_0.840758586 \n", "\n", " new_car \n", "0 10_1_-1_0_1_4_1_0_0_1_12_2_0.4_0.8836789178_0.... \n", "1 11_1_-1_0_-1_11_1_1_2_1_19_3_0.316227766_0.618... \n", "2 7_1_-1_0_-1_14_1_1_2_1_60_1_0.316227766_0.6415... \n", "3 7_1_0_0_1_11_1_1_3_1_104_1_0.3741657387_0.5429... \n", "4 11_1_-1_0_-1_14_1_1_2_1_82_3_0.3160696126_0.56... " ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "display(train[['new_ind','new_reg','new_car']].head(5))" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "'''\n", "count_features\n", "\n", "preparing for train[cat_count_features] \n", "cat_fea = \n", "['ps_ind_02_cat','ps_ind_04_cat','ps_ind_05_cat', 'ps_car_01_cat', 'ps_car_02_cat',\n", " 'ps_car_03_cat','ps_car_04_cat','ps_car_05_cat','ps_car_06_cat','ps_car_07_cat',\n", " 'ps_car_08_cat','ps_car_09_cat','ps_car_10_cat', 'ps_car_11_cat']\n", "\n", "Example: \n", "ps_ind_02_cat_count\n", "dictionay of ps_ind_02_cat \n", "([(1, 1079327), (2, 309747), (3, 70172), (4, 28259), (-1, 523)])\n", "\n", "row count origial value\n", "595202 1079327 1 \n", "595203 309747 2\n", "595204 309747 2\n", "595205 70172 3\n", "595206 1079327 1\n", "\n", "''' \n", "\n", "cat_fea = [ name for name in list(train) if 'cat' in name and 'count' not in name]\n", "features= cat_fea + ['new_ind','new_reg','new_car']\n", "\n", "train, test, cat_count_features= Features_Counts(train, test, features)\n", "\n", "\n", " " ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ps_ind_02_cat_countps_ind_04_cat_countps_ind_05_cat_countps_car_01_cat_countps_car_02_cat_countps_car_03_cat_countps_car_04_cat_countps_car_05_cat_countps_car_06_cat_countps_car_07_cat_countps_car_08_cat_countps_car_09_cat_countps_car_10_cat_countps_car_11_cat_countnew_ind_countnew_reg_countnew_car_count
030974762093613194121245871234979102814212413344315607784513830702496634865101475460183266241
11079327866864131941251872512349791028142124133466691032989013830701238365883326147546012535363811
228259620936131941244961712349791028142124133466691014771413830701238365883326147546019943241347740
310793278668641319412449617123497918304412413344315603298901383070123836536798147546021298927842221
43097476209361319412518725123497910281421241334666910147714138307012383658833261475460261612583413
\n", "
" ], "text/plain": [ " ps_ind_02_cat_count ps_ind_04_cat_count ps_ind_05_cat_count \\\n", "0 309747 620936 1319412 \n", "1 1079327 866864 1319412 \n", "2 28259 620936 1319412 \n", "3 1079327 866864 1319412 \n", "4 309747 620936 1319412 \n", "\n", " ps_car_01_cat_count ps_car_02_cat_count ps_car_03_cat_count \\\n", "0 124587 1234979 1028142 \n", "1 518725 1234979 1028142 \n", "2 449617 1234979 1028142 \n", "3 449617 1234979 183044 \n", "4 518725 1234979 1028142 \n", "\n", " ps_car_04_cat_count ps_car_05_cat_count ps_car_06_cat_count \\\n", "0 1241334 431560 77845 \n", "1 1241334 666910 329890 \n", "2 1241334 666910 147714 \n", "3 1241334 431560 329890 \n", "4 1241334 666910 147714 \n", "\n", " ps_car_07_cat_count ps_car_08_cat_count ps_car_09_cat_count \\\n", "0 1383070 249663 486510 \n", "1 1383070 1238365 883326 \n", "2 1383070 1238365 883326 \n", "3 1383070 1238365 36798 \n", "4 1383070 1238365 883326 \n", "\n", " ps_car_10_cat_count ps_car_11_cat_count new_ind_count new_reg_count \\\n", "0 1475460 18326 6 24 \n", "1 1475460 12535 36 38 \n", "2 1475460 19943 24 13477 \n", "3 1475460 212989 2784 222 \n", "4 1475460 26161 258 34 \n", "\n", " new_car_count \n", "0 1 \n", "1 11 \n", "2 40 \n", "3 1 \n", "4 13 " ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# display(train[['new_ind','new_reg','new_car']].head(5))\n", "display(train[cat_count_features].head(5))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2.3 Get the feature from feature training ## " ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "train_fea0, test_fea0 = pickle.load(open(\"../input/fea0.pk\",'rb'), encoding='iso-8859-1')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2.4 Statistic features ##\n", "\n", "- find the feature of median, mean and standard deviation" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "#feature aggregation\n", "target_features = ['ps_car_13', 'ps_ind_03', 'ps_reg_03', 'ps_ind_15', 'ps_reg_01', 'ps_ind_01']\n", "group_features = ['ps_car_13', 'ps_ind_03', 'ps_reg_03', 'ps_ind_15', 'ps_reg_01', 'ps_ind_01', 'ps_ind_05_cat']\n", "\n", "#return numpy because we need to do np.hstack to merge all statistic feature together, so that it would return np array\n", "train_statis, test_statis = Statistic_features(train, test, target_features, group_features)" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[ 1.57043000e+05, 8.32786207e-01, 2.41530046e-01, ...,\n", " 1.00000000e+00, 7.00000000e+00, 0.00000000e+00],\n", " [ 1.30452000e+05, 8.26528390e-01, 2.35133348e-01, ...,\n", " 1.00000000e+00, 7.00000000e+00, 0.00000000e+00],\n", " [ 6.35510000e+04, 8.13168936e-01, 2.35946815e-01, ...,\n", " 1.00000000e+00, 7.00000000e+00, 0.00000000e+00],\n", " ..., \n", " [ 3.58630000e+04, 8.00633360e-01, 2.34463222e-01, ...,\n", " 1.00000000e+00, 7.00000000e+00, 0.00000000e+00],\n", " [ 2.04836000e+05, 8.24270444e-01, 2.28649975e-01, ...,\n", " 1.00000000e+00, 7.00000000e+00, 0.00000000e+00],\n", " [ 9.85280000e+04, 8.24453229e-01, 2.37806003e-01, ...,\n", " 1.00000000e+00, 7.00000000e+00, 0.00000000e+00]])" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "display(train_statis)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2.5 Combine all feature together, and get ready for training ##\n", "\n", "1. merge features into train_list & test_list that would like to dump into NN model\n", "2. training a scaler by sparse that generated by train_list & test_list\n", "4. convert train_list & test_list into X , X_test, which has been scaled" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "'''\n", "Building a train list including train_num, cat_count_features, statistic feature and infered feature training by XGboost.\n", "\n", "train_num: training data without set of cat_calc\n", "cat_count_features: cat_fea + ['new_ind','new_reg','new_car']\n", "train_fea0: feature extraction \n", "'''\n", "\n", "\n", "#training data without set of cat_calc\n", "train_num = train[[x for x in list(train) if x in num_features]]\n", "test_num = test[[x for x in list(train) if x in num_features]]\n", "\n", "train_list = [train_num.replace([np.inf, -np.inf, np.nan], 0), train[cat_count_features], train_statis, train_fea0 ]\n", "test_list = [test_num.replace([np.inf, -np.inf, np.nan], 0), test[cat_count_features], test_statis,test_fea0] " ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "scrolled": true }, "outputs": [], "source": [ "'''\n", "X are stacked from 5 features\n", "1. train_num(595212,54): training data without set of cat_calc\n", "2. cat_count_features(595212,17): cat_fea + ['new_ind','new_reg','new_car']\n", "3. feature statis(595212,6) * 36\n", "4. train_fea0(595212, 38): feature extraction\n", "\n", "all_data (595212, 235)\n", "'''\n", "\n", "\n", "X = sparse.hstack(train_list).tocsr()\n", "X_test = sparse.hstack(test_list).tocsr()\n", "\n", "all_data = np.vstack([X.toarray(), X_test.toarray()])\n", "scaler = StandardScaler()\n", "scaler.fit(all_data)\n", "X = scaler.transform(X.toarray())\n", "X_test = scaler.transform(X_test.toarray())\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2.6 Create Cat_feature for NN embeding training ##\n", " Don't ask why they doing this! they only tell you what is this\n", " \n", " - in the feature NN model of the finial testing data, you would need the list, which likes **[[cat_featrue], X]**\n", " or **[['ps_ind_02_cat', 'ps_ind_04_cat', 'ps_ind_05_cat',.....,'ps_car_11_cat'], X]**\n", " \n", " **_This is to process the above testing data. If you could not understand, that is fine, and just look the next steps_**" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/Users/stevenhu/anaconda3/envs/run/lib/python3.6/site-packages/pandas/core/indexing.py:630: SettingWithCopyWarning: \n", "A value is trying to be set on a copy of a slice from a DataFrame.\n", "Try using .loc[row_indexer,col_indexer] = value instead\n", "\n", "See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy\n", " self.obj[item_labels[indexer[info_axis]]] = value\n" ] } ], "source": [ "#preparing for training cat \n", "train_cat = train[cat_fea]\n", "test_cat = test[cat_fea]\n", "\n", "# convert pd to np.array\n", "X_cat = train_cat.values\n", "tem = test_cat.values\n", "\n", "# storing the dimension for embedding layer as an input value\n", "max_cat_values = []\n", "\n", "for c in cat_fea:\n", " \n", " #nomalize the label\n", " #LabelEncoder: http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html\n", " \n", " le = LabelEncoder()\n", " x = le.fit_transform(pd.concat([train_cat, test_cat])[c])\n", " train_cat.loc[:,c] = le.transform(train_cat[c])\n", " test_cat.loc[:,c] = le.transform(test_cat[c])\n", " max_cat_values.append(np.max(x))\n", "\n", "# Build the final testing data\n", "X_TEST_CAT = []\n", "for i in range(tem.shape[1]):\n", " X_TEST_CAT.append(tem[:, i].reshape(-1, 1))\n", "X_TEST_CAT.append(X_test)\n" ] }, { "cell_type": "code", "execution_count": 57, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "cat_fea: ['ps_ind_02_cat', 'ps_ind_04_cat', 'ps_ind_05_cat', 'ps_car_01_cat', 'ps_car_02_cat', 'ps_car_03_cat', 'ps_car_04_cat', 'ps_car_05_cat', 'ps_car_06_cat', 'ps_car_07_cat', 'ps_car_08_cat', 'ps_car_09_cat', 'ps_car_10_cat', 'ps_car_11_cat']\n", "\n", "max_cat_values: [4, 2, 7, 12, 2, 2, 9, 2, 17, 2, 1, 5, 2, 103]\n" ] } ], "source": [ "print('cat_fea:', cat_fea)\n", "print('\\nmax_cat_values: ',max_cat_values)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 3. Training NN Model with Keras # " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "1. Build the model\n", "2. training" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3.1 Build the model ##" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### model structure: ###\n", "" ] }, { "cell_type": "code", "execution_count": 47, "metadata": {}, "outputs": [], "source": [ "def nn_model():\n", " inputs = []\n", " flatten_layers = []\n", " for e, c in enumerate(cat_fea):\n", " input_c = Input(shape=(1, ), dtype='int32')\n", " num_c = max_cat_values[e]\n", " \n", " # need to add 1, https://keras.io/layers/embeddings/\n", " # **input_dim: int > 0. Size of the vocabulary, i.e. maximum integer index + 1.**\n", " embed_c = Embedding(num_c+1,6,input_length=1)(input_c)\n", " embed_c = Dropout(0.25)(embed_c)\n", " flatten_c = Flatten()(embed_c)\n", " inputs.append(input_c)\n", " flatten_layers.append(flatten_c)\n", " \n", " \n", " input_num = Input(shape=(X.shape[1],), dtype='float32')\n", " inputs.append(input_num)\n", " \n", " #merge X and embedding layer\n", " flatten_layers.append(input_num)\n", " flatten = merge(flatten_layers, mode='concat')\n", "\n", " fc1 = Dense(512, kernel_initializer='he_normal')(flatten)\n", " fc1 = PReLU()(fc1)\n", " fc1 = BatchNormalization()(fc1)\n", " fc1 = Dropout(0.75)(fc1)\n", "\n", " fc1 = Dense(64, kernel_initializer='he_normal')(fc1)\n", " fc1 = PReLU()(fc1)\n", " fc1 = BatchNormalization()(fc1)\n", " fc1 = Dropout(0.5)(fc1)\n", "\n", " outputs = Dense(1, kernel_initializer='he_normal', activation='sigmoid')(fc1)\n", "\n", " model = Model(inputs = inputs, outputs = outputs)\n", " model.compile(loss='binary_crossentropy', optimizer='adam')\n", " return (model)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3.2 Start to Train ##" ] }, { "cell_type": "code", "execution_count": 51, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/Users/stevenhu/anaconda3/envs/run/lib/python3.6/site-packages/ipykernel_launcher.py:22: UserWarning: The `merge` function is deprecated and will be removed after 08/2017. Use instead layers from `keras.layers.merge`, e.g. `add`, `concatenate`, etc.\n", "/Users/stevenhu/anaconda3/envs/run/lib/python3.6/site-packages/keras/legacy/layers.py:465: UserWarning: The `Merge` layer is deprecated and will be removed after 08/2017. Use instead layers from `keras.layers.merge`, e.g. `add`, `concatenate`, etc.\n", " name=name)\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Train on 297606 samples, validate on 297606 samples\n", "Epoch 1/1\n", " - 27s - loss: 0.3061 - val_loss: 0.1639\n", "local fold Gini: 0.209322322663\n", "Train on 297606 samples, validate on 297606 samples\n", "Epoch 1/1\n", " - 28s - loss: 0.3087 - val_loss: 0.1645\n", "local fold Gini: 0.201585256464\n", "seed 0: Gini 0.20545379936910999\n", "Total training time: 0:01:55.265348\n" ] } ], "source": [ "\"\"\"\n", "#validation fold\n", "NFOLDS = 5\n", "kfold = StratifiedKFold(n_splits=NFOLDS, shuffle=True, random_state=218)\n", "\n", "I change \"test\" to \"vaild\" because I feel it is clear to understand\n", "\"\"\"\n", "\n", "cv_train = np.zeros(len(train_label))\n", "cv_pred = np.zeros(len(test_id))\n", "\n", "#validation fold\n", "NFOLDS = 5\n", "kfold = StratifiedKFold(n_splits=NFOLDS, shuffle=True, random_state=218)\n", "\n", "#with different random see make result stable.\n", "num_seeds = 5\n", "begintime = time()\n", "if cv_only:\n", " for s in range(num_seeds):\n", " np.random.seed(s)\n", " for (train_index, valid_index) in kfold.split(X, train_label):\n", " \n", " #assign data from training data and labels to validation data; \n", " x_train = X[train_index]\n", " y_train = train_label[train_index]\n", " x_valid= X[valid_index]\n", " y_valid = train_label[valid_index]\n", " \n", " # assign X_cat to validation data; \n", " x_train_cat = X_cat[train_index]\n", " x_valid_cat = X_cat[valid_index]\n", "\n", " #Package data for training, the package(list) is [[cat_featrues], x_train] \n", " # or [ ['ps_ind_02_cat', 'ps_ind_04_cat', 'ps_ind_05_cat',.....,'ps_car_11_cat'] ,x_train]\n", " \n", " x_train_cat_list, x_valid_cat_list = [], []\n", " for i in range(x_train_cat.shape[1]):\n", " x_train_cat_list.append(x_train_cat[:, i].reshape(-1, 1))\n", " x_valid_cat_list.append(x_valid_cat[:, i].reshape(-1, 1))\n", "\n", " x_train_cat_list.append(x_train)\n", " x_valid_cat_list.append(x_valid)\n", " \n", " #load model\n", " model = nn_model()\n", " \n", " def get_rank(x):\n", " return pd.Series(x).rank(pct=True).values\n", " #fit model. Note: Change epochs to make prediction accuracy\n", " model.fit(x_train_cat_list, y_train, epochs=10, batch_size=512, verbose=2, validation_data=[x_valid_cat_list, y_valid])\n", " \n", " #record prediction with validation data\n", " cv_train[valid_index] += get_rank(model.predict(x=x_valid_cat_list, batch_size=512, verbose=0)[:, 0])\n", " print('local fold Gini: ',Gini(train_label[valid_index], cv_train[valid_index]))\n", " \n", " #recode prediction with testing data\n", " cv_pred += get_rank(model.predict(x=X_TEST_CAT, batch_size=512, verbose=0)[:, 0])\n", " \n", " \n", " \n", " print(\"seed {0}: Gini {1}\".format(s,Gini(train_label, cv_train / (1. * (s + 1)))))\n", " print(\"Total training time: \",str(datetime.timedelta(seconds=time() - begintime)))\n", " if save_cv:\n", " \n", " #divid (NFOLDS * num_seeds) to get average of probablity \n", " pd.DataFrame({'id': test_id, 'target': get_rank(cv_pred * 1./ (NFOLDS * num_seeds))}).to_csv('../model/keras5_pred.csv', index=False)\n", " pd.DataFrame({'id': train_id, 'target': get_rank(cv_train * 1. / num_seeds)}).to_csv('../model/keras5_cv.csv', index=False)\n", "\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python [default]", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.4" } }, "nbformat": 4, "nbformat_minor": 2 } ================================================ FILE: Jupyter_nnmodel/util.py ================================================ import numpy as np import pandas as pd def Gini(y_true, y_pred): # check and get number of samples assert y_true.shape == y_pred.shape n_samples = y_true.shape[0] # sort rows on prediction column # (from largest to smallest) arr = np.array([y_true, y_pred]).transpose() true_order = arr[arr[:, 0].argsort()][::-1, 0] pred_order = arr[arr[:, 1].argsort()][::-1, 0] # get Lorenz curves L_true = np.cumsum(true_order) * 1. / np.sum(true_order) L_pred = np.cumsum(pred_order) * 1. / np.sum(pred_order) L_ones = np.linspace(1 / n_samples, 1, n_samples) # get Gini coefficients (area between curves) G_true = np.sum(L_ones - L_true) G_pred = np.sum(L_ones - L_pred) # normalize to true Gini coefficient return G_pred * 1. / G_true def cat_count(train_df, test_df, cat_list): train_df['row_id'] = range(train_df.shape[0]) test_df['row_id'] = range(test_df.shape[0]) train_df['train'] = 1 test_df['train'] = 0 all_df = train_df[['row_id', 'train'] + cat_list].append(test_df[['row_id','train'] + cat_list]) for e, cat in enumerate(cat_list): grouped = all_df[[cat]].groupby(cat) the_size = pd.DataFrame(grouped.size()).reset_index() the_size.columns = [cat, '{}_size'.format(cat)] all_df = pd.merge(all_df, the_size, how='left') selected_train = all_df[all_df['train'] == 1] selected_test = all_df[all_df['train'] == 0] selected_train.sort_values('row_id', inplace=True) selected_test.sort_values('row_id', inplace=True) selected_train.drop(['row_id', 'train'] + cat_list, axis=1, inplace=True) selected_test.drop(['row_id', 'train'] + cat_list, axis=1, inplace=True) selected_train, selected_test = np.array(selected_train), np.array(selected_test) print(selected_train.shape, selected_test.shape) return selected_train, selected_test def proj_num_on_cat(train_df, test_df, target_column, group_column): """ :param train_df: train data frame :param test_df: test data frame :param target_column: name of numerical feature :param group_column: name of categorical feature """ train_df['row_id'] = range(train_df.shape[0]) # 595211 create index for each row test_df['row_id'] = range(test_df.shape[0]) train_df['train'] = 1 test_df['train'] = 0 all_df = train_df[['row_id', 'train', target_column, group_column]].append(test_df[['row_id','train', target_column, group_column]]).copy() #count the number grouped = all_df[[target_column, group_column]].groupby(group_column) #count the number of distint value from the list [1,1, 2,3] #[1,2,3] so answer is 3 #count the number of each distint value [1,1,2,3] #1:2 #2:1 #3:1 #count the number of each distint value the_size = pd.DataFrame(grouped.size()).reset_index() the_size.columns = [group_column, '%s_size' % target_column] #rename columns name #find the mean, std, median, max, min of each distint value the_mean = pd.DataFrame(grouped.mean()).reset_index() the_mean.columns = [group_column, '%s_mean' % target_column] #rename columns name the_std = pd.DataFrame(grouped.std()).reset_index().fillna(0) the_std.columns = [group_column, '%s_std' % target_column] the_median = pd.DataFrame(grouped.median()).reset_index() the_median.columns = [group_column, '%s_median' % target_column] the_max = pd.DataFrame(grouped.max()).reset_index() the_max.columns = [group_column, '%s_max' % target_column] the_min = pd.DataFrame(grouped.min()).reset_index() the_min.columns = [group_column, '%s_min' % target_column] #merge them the_stats=pd.concat([the_size,the_mean.iloc[:,1],the_std.iloc[:,1] ,the_median.iloc[:,1] ,the_max.iloc[:,1],the_min.iloc[:,1]] ,axis=1, join_axes=[the_size.index]) #insert value to the original data all_df = pd.merge(all_df, the_stats, how='left') #splite to train and test selected_train = all_df[all_df['train'] == 1].copy() selected_test = all_df[all_df['train'] == 0].copy() selected_train.sort_values('row_id', inplace=True) selected_test.sort_values('row_id', inplace=True) #remove target_column, group_column, 'row_id', 'train' columns selected_train.drop([target_column, group_column, 'row_id', 'train'], axis=1, inplace=True) selected_test.drop([target_column, group_column, 'row_id', 'train'], axis=1, inplace=True) selected_train, selected_test = np.array(selected_train), np.array(selected_test) return selected_train, selected_test def interaction_features(train, test, fea1, fea2, prefix): train['inter_{}*'.format(prefix)] = train[fea1] * train[fea2] train['inter_{}/'.format(prefix)] = train[fea1] / train[fea2] test['inter_{}*'.format(prefix)] = test[fea1] * test[fea2] test['inter_{}/'.format(prefix)] = test[fea1] / test[fea2] feature_name = ['inter_{}*'.format(prefix), 'inter_{}/'.format(prefix), ] return train, test, feature_name ================================================ FILE: code/fea_eng0.py ================================================ """xgb prediction as features""" import xgboost as xgb from sklearn.model_selection import KFold import numpy as np import pandas as pd eta = 0.1 max_depth = 6 subsample = 0.9 colsample_bytree = 0.85 min_child_weight = 55 num_boost_round = 500 train = pd.read_csv("../input/train.csv") train_label = train['target'] train_id = train['id'] del train['target'], train['id'] test = pd.read_csv("../input/test.csv") test_id = test['id'] del test['id'] params = {"objective": "reg:linear", "booster": "gbtree", "eta": eta, "max_depth": int(max_depth), "subsample": subsample, "colsample_bytree": colsample_bytree, "min_child_weight": min_child_weight, "silent": 1 } data = train.append(test) data.reset_index(inplace=True) train_rows = train.shape[0] feature_results = [] for target_g in ['car', 'ind', 'reg']: features = [x for x in list(data) if target_g not in x] target_list = [x for x in list(data) if target_g in x] train_fea = np.array(data[features]) for target in target_list: print(target) train_label = data[target] kfold = KFold(n_splits=5, random_state=218, shuffle=True) kf = kfold.split(data) cv_train = np.zeros(shape=(data.shape[0], 1)) for i, (train_fold, validate) in enumerate(kf): X_train, X_validate, label_train, label_validate = \ train_fea[train_fold, :], train_fea[validate, :], train_label[train_fold], train_label[validate] dtrain = xgb.DMatrix(X_train, label_train) dvalid = xgb.DMatrix(X_validate, label_validate) watchlist = [(dtrain, 'train'), (dvalid, 'valid')] bst = xgb.train(params, dtrain, num_boost_round, evals=watchlist, verbose_eval=50, early_stopping_rounds=10) cv_train[validate, 0] += bst.predict(xgb.DMatrix(X_validate), ntree_limit=bst.best_ntree_limit) feature_results.append(cv_train) feature_results = np.hstack(feature_results) train_features = feature_results[:train_rows, :] test_features = feature_results[train_rows:, :] import pickle pickle.dump([train_features, test_features], open("../input/fea0.pk", 'wb')) ================================================ FILE: code/gbm_model291.py ================================================ import lightgbm as lgbm from scipy import sparse as ssp from sklearn.model_selection import StratifiedKFold import numpy as np import pandas as pd from sklearn.preprocessing import LabelEncoder from sklearn.preprocessing import OneHotEncoder def Gini(y_true, y_pred): # check and get number of samples assert y_true.shape == y_pred.shape n_samples = y_true.shape[0] # sort rows on prediction column # (from largest to smallest) arr = np.array([y_true, y_pred]).transpose() true_order = arr[arr[:, 0].argsort()][::-1, 0] pred_order = arr[arr[:, 1].argsort()][::-1, 0] # get Lorenz curves L_true = np.cumsum(true_order) * 1. / np.sum(true_order) L_pred = np.cumsum(pred_order) * 1. / np.sum(pred_order) L_ones = np.linspace(1 / n_samples, 1, n_samples) # get Gini coefficients (area between curves) G_true = np.sum(L_ones - L_true) G_pred = np.sum(L_ones - L_pred) # normalize to true Gini coefficient return G_pred * 1. / G_true cv_only = True save_cv = True full_train = False def evalerror(preds, dtrain): labels = dtrain.get_label() return 'gini', Gini(labels, preds), True path = "../input/" train = pd.read_csv(path+'train.csv') train_label = train['target'] train_id = train['id'] test = pd.read_csv(path+'test.csv') test_id = test['id'] NFOLDS = 5 kfold = StratifiedKFold(n_splits=NFOLDS, shuffle=True, random_state=218) y = train['target'].values drop_feature = [ 'id', 'target' ] X = train.drop(drop_feature,axis=1) feature_names = X.columns.tolist() cat_features = [c for c in feature_names if ('cat' in c and 'count' not in c)] num_features = [c for c in feature_names if ('cat' not in c and 'calc' not in c)] train['missing'] = (train==-1).sum(axis=1).astype(float) test['missing'] = (test==-1).sum(axis=1).astype(float) num_features.append('missing') for c in cat_features: le = LabelEncoder() le.fit(train[c]) train[c] = le.transform(train[c]) test[c] = le.transform(test[c]) enc = OneHotEncoder() enc.fit(train[cat_features]) X_cat = enc.transform(train[cat_features]) X_t_cat = enc.transform(test[cat_features]) ind_features = [c for c in feature_names if 'ind' in c] count=0 for c in ind_features: if count==0: train['new_ind'] = train[c].astype(str)+'_' test['new_ind'] = test[c].astype(str)+'_' count+=1 else: train['new_ind'] += train[c].astype(str)+'_' test['new_ind'] += test[c].astype(str)+'_' cat_count_features = [] for c in cat_features+['new_ind']: d = pd.concat([train[c],test[c]]).value_counts().to_dict() train['%s_count'%c] = train[c].apply(lambda x:d.get(x,0)) test['%s_count'%c] = test[c].apply(lambda x:d.get(x,0)) cat_count_features.append('%s_count'%c) train_list = [train[num_features+cat_count_features].values,X_cat,] test_list = [test[num_features+cat_count_features].values,X_t_cat,] X = ssp.hstack(train_list).tocsr() X_test = ssp.hstack(test_list).tocsr() learning_rate = 0.1 num_leaves = 15 min_data_in_leaf = 2000 feature_fraction = 0.6 num_boost_round = 10000 params = {"objective": "binary", "boosting_type": "gbdt", "learning_rate": learning_rate, "num_leaves": num_leaves, "max_bin": 256, "feature_fraction": feature_fraction, "verbosity": 0, "drop_rate": 0.1, "is_unbalance": False, "max_drop": 50, "min_child_samples": 10, "min_child_weight": 150, "min_split_gain": 0, "subsample": 0.9 } x_score = [] final_cv_train = np.zeros(len(train_label)) final_cv_pred = np.zeros(len(test_id)) for s in xrange(16): cv_train = np.zeros(len(train_label)) cv_pred = np.zeros(len(test_id)) params['seed'] = s if cv_only: kf = kfold.split(X, train_label) best_trees = [] fold_scores = [] for i, (train_fold, validate) in enumerate(kf): X_train, X_validate, label_train, label_validate = \ X[train_fold, :], X[validate, :], train_label[train_fold], train_label[validate] dtrain = lgbm.Dataset(X_train, label_train) dvalid = lgbm.Dataset(X_validate, label_validate, reference=dtrain) bst = lgbm.train(params, dtrain, num_boost_round, valid_sets=dvalid, feval=evalerror, verbose_eval=100, early_stopping_rounds=100) best_trees.append(bst.best_iteration) cv_pred += bst.predict(X_test, num_iteration=bst.best_iteration) cv_train[validate] += bst.predict(X_validate) score = Gini(label_validate, cv_train[validate]) print score fold_scores.append(score) cv_pred /= NFOLDS final_cv_train += cv_train final_cv_pred += cv_pred print("cv score:") print Gini(train_label, cv_train) print "current score:", Gini(train_label, final_cv_train / (s + 1.)), s+1 print(fold_scores) print(best_trees, np.mean(best_trees)) x_score.append(Gini(train_label, cv_train)) print(x_score) pd.DataFrame({'id': test_id, 'target': final_cv_pred / 16.}).to_csv('../model/lgbm3_pred_avg.csv', index=False) pd.DataFrame({'id': train_id, 'target': final_cv_train / 16.}).to_csv('../model/lgbm3_cv_avg.csv', index=False) ================================================ FILE: code/nn_model290.py ================================================ from keras.layers import Dense, Dropout, Embedding, Flatten, Input, merge from keras.layers.normalization import BatchNormalization from keras.layers.advanced_activations import PReLU from time import time import datetime from keras.models import Model from sklearn.model_selection import StratifiedKFold import numpy as np import pandas as pd from util import Gini, interaction_features from itertools import combinations from util import proj_num_on_cat from scipy import sparse from sklearn.preprocessing import StandardScaler import pickle from sklearn.preprocessing import LabelEncoder cv_only = True save_cv = True NFOLDS = 5 kfold = StratifiedKFold(n_splits=NFOLDS, shuffle=True, random_state=218) train = pd.read_csv("../input/train.csv") train_label = train['target'] train_id = train['id'] del train['target'], train['id'] test = pd.read_csv("../input/test.csv") test_id = test['id'] del test['id'] cat_fea = [x for x in list(train) if 'cat' in x] bin_fea = [x for x in list(train) if 'bin' in x] train['missing'] = (train==-1).sum(axis=1).astype(float) test['missing'] = (test==-1).sum(axis=1).astype(float) # include interactions for e, (x, y) in enumerate(combinations(['ps_car_13', 'ps_ind_03', 'ps_reg_03', 'ps_ind_15', 'ps_reg_01', 'ps_ind_01'], 2)): train, test = interaction_features(train, test, x, y, e) num_features = [c for c in list(train) if ('cat' not in c and 'calc' not in c)] num_features.append('missing') inter_fea = [x for x in list(train) if 'inter' in x] feature_names = list(train) ind_features = [c for c in feature_names if 'ind' in c] count = 0 for c in ind_features: if count == 0: train['new_ind'] = train[c].astype(str) count += 1 else: train['new_ind'] += '_' + train[c].astype(str) ind_features = [c for c in feature_names if 'ind' in c] count = 0 for c in ind_features: if count == 0: test['new_ind'] = test[c].astype(str) count += 1 else: test['new_ind'] += '_' + test[c].astype(str) reg_features = [c for c in feature_names if 'reg' in c] count = 0 for c in reg_features: if count == 0: train['new_reg'] = train[c].astype(str) count += 1 else: train['new_reg'] += '_' + train[c].astype(str) reg_features = [c for c in feature_names if 'reg' in c] count = 0 for c in reg_features: if count == 0: test['new_reg'] = test[c].astype(str) count += 1 else: test['new_reg'] += '_' + test[c].astype(str) car_features = [c for c in feature_names if 'car' in c] count = 0 for c in car_features: if count == 0: train['new_car'] = train[c].astype(str) count += 1 else: train['new_car'] += '_' + train[c].astype(str) car_features = [c for c in feature_names if 'car' in c] count = 0 for c in car_features: if count == 0: test['new_car'] = test[c].astype(str) count += 1 else: test['new_car'] += '_' + test[c].astype(str) train_cat = train[cat_fea] train_num = train[[x for x in list(train) if x in num_features]] test_cat = test[cat_fea] test_num = test[[x for x in list(train) if x in num_features]] max_cat_values = [] for c in cat_fea: le = LabelEncoder() x = le.fit_transform(pd.concat([train_cat, test_cat])[c]) train_cat[c] = le.transform(train_cat[c]) test_cat[c] = le.transform(test_cat[c]) max_cat_values.append(np.max(x)) # xgboost prediction train_fea0, test_fea0 = pickle.load(open("../input/fea0.pk")) cat_count_features = [] for c in cat_fea + ['new_ind','new_reg','new_car']: d = pd.concat([train[c],test[c]]).value_counts().to_dict() train['%s_count'%c] = train[c].apply(lambda x:d.get(x,0)) test['%s_count'%c] = test[c].apply(lambda x:d.get(x,0)) cat_count_features.append('%s_count'%c) print(train_num.dtypes) train_list = [train_num.replace([np.inf, -np.inf, np.nan], 0), train[cat_count_features], train_fea0] test_list = [test_num.replace([np.inf, -np.inf, np.nan], 0), test[cat_count_features], test_fea0] #feature aggregation for t in ['ps_car_13', 'ps_ind_03', 'ps_reg_03', 'ps_ind_15', 'ps_reg_01', 'ps_ind_01']: for g in ['ps_car_13', 'ps_ind_03', 'ps_reg_03', 'ps_ind_15', 'ps_reg_01', 'ps_ind_01', 'ps_ind_05_cat']: if t != g: s_train, s_test = proj_num_on_cat(train, test, target_column=t, group_column=g) train_list.append(s_train) test_list.append(s_test) X = sparse.hstack(train_list).tocsr() X_test = sparse.hstack(test_list).tocsr() all_data = np.vstack([X.toarray(), X_test.toarray()]) scaler = StandardScaler() scaler.fit(all_data) X = scaler.transform(X.toarray()) X_test = scaler.transform(X_test.toarray()) print(X.shape, X_test.shape) cv_train = np.zeros(len(train_label)) cv_pred = np.zeros(len(test_id)) X_cat = train_cat.as_matrix() X_test_cat = test_cat.as_matrix() x_test_cat = [] for i in xrange(X_test_cat.shape[1]): x_test_cat.append(X_test_cat[:, i].reshape(-1, 1)) x_test_cat.append(X_test) def nn_model(): inputs = [] flatten_layers = [] for e, c in enumerate(cat_fea): input_c = Input(shape=(1, ), dtype='int32') num_c = max_cat_values[e] embed_c = Embedding( num_c, 6, input_length=1 )(input_c) embed_c = Dropout(0.25)(embed_c) flatten_c = Flatten()(embed_c) inputs.append(input_c) flatten_layers.append(flatten_c) input_num = Input(shape=(X.shape[1],), dtype='float32') flatten_layers.append(input_num) inputs.append(input_num) flatten = merge(flatten_layers, mode='concat') fc1 = Dense(512, init='he_normal')(flatten) fc1 = PReLU()(fc1) fc1 = BatchNormalization()(fc1) fc1 = Dropout(0.75)(fc1) fc1 = Dense(64, init='he_normal')(fc1) fc1 = PReLU()(fc1) fc1 = BatchNormalization()(fc1) fc1 = Dropout(0.5)(fc1) outputs = Dense(1, init='he_normal', activation='sigmoid')(fc1) model = Model(input = inputs, output = outputs) model.compile(loss='binary_crossentropy', optimizer='adam') return (model) num_seeds = 5 begintime = time() if cv_only: for s in xrange(num_seeds): np.random.seed(s) for (inTr, inTe) in kfold.split(X, train_label): xtr = X[inTr] ytr = train_label[inTr] xte = X[inTe] yte = train_label[inTe] xtr_cat = X_cat[inTr] xte_cat = X_cat[inTe] # get xtr xte cat xtr_cat_list, xte_cat_list = [], [] for i in xrange(xtr_cat.shape[1]): xtr_cat_list.append(xtr_cat[:, i].reshape(-1, 1)) xte_cat_list.append(xte_cat[:, i].reshape(-1, 1)) xtr_cat_list.append(xtr) xte_cat_list.append(xte) model = nn_model() def get_rank(x): return pd.Series(x).rank(pct=True).values model.fit(xtr_cat_list, ytr, epochs=20, batch_size=512, verbose=2, validation_data=[xte_cat_list, yte]) cv_train[inTe] += get_rank(model.predict(x=xte_cat_list, batch_size=512, verbose=0)[:, 0]) print(Gini(train_label[inTe], cv_train[inTe])) cv_pred += get_rank(model.predict(x=x_test_cat, batch_size=512, verbose=0)[:, 0]) print(s) print(Gini(train_label, cv_train / (1. * (s + 1)))) print(str(datetime.timedelta(seconds=time() - begintime))) if save_cv: pd.DataFrame({'id': test_id, 'target': get_rank(cv_pred * 1./ (NFOLDS * num_seeds))}).to_csv('../model/keras5_pred.csv', index=False) pd.DataFrame({'id': train_id, 'target': get_rank(cv_train * 1. / num_seeds)}).to_csv('../model/keras5_cv.csv', index=False) ================================================ FILE: code/simple_average.py ================================================ ''' simple average of two models to get 2nd place ''' import pandas as pd keras5_test = pd.read_csv("../model/keras5_pred.csv") lgbm3_test = pd.read_csv("../model/lgbm3_pred_avg.csv") def get_rank(x): return pd.Series(x).rank(pct=True).values pd.DataFrame({'id': keras5_test['id'], 'target': get_rank(keras5_test['target']) * 0.5 + get_rank(keras5_test['target']) * 0.5}).to_csv( "../model/simple_average.csv", index = False) ================================================ FILE: code/util.py ================================================ import numpy as np import pandas as pd def Gini(y_true, y_pred): # check and get number of samples assert y_true.shape == y_pred.shape n_samples = y_true.shape[0] # sort rows on prediction column # (from largest to smallest) arr = np.array([y_true, y_pred]).transpose() true_order = arr[arr[:, 0].argsort()][::-1, 0] pred_order = arr[arr[:, 1].argsort()][::-1, 0] # get Lorenz curves L_true = np.cumsum(true_order) * 1. / np.sum(true_order) L_pred = np.cumsum(pred_order) * 1. / np.sum(pred_order) L_ones = np.linspace(1 / n_samples, 1, n_samples) # get Gini coefficients (area between curves) G_true = np.sum(L_ones - L_true) G_pred = np.sum(L_ones - L_pred) # normalize to true Gini coefficient return G_pred * 1. / G_true def cat_count(train_df, test_df, cat_list): train_df['row_id'] = range(train_df.shape[0]) test_df['row_id'] = range(test_df.shape[0]) train_df['train'] = 1 test_df['train'] = 0 all_df = train_df[['row_id', 'train'] + cat_list].append(test_df[['row_id','train'] + cat_list]) for e, cat in enumerate(cat_list): grouped = all_df[[cat]].groupby(cat) the_size = pd.DataFrame(grouped.size()).reset_index() the_size.columns = [cat, '{}_size'.format(cat)] all_df = pd.merge(all_df, the_size, how='left') selected_train = all_df[all_df['train'] == 1] selected_test = all_df[all_df['train'] == 0] selected_train.sort_values('row_id', inplace=True) selected_test.sort_values('row_id', inplace=True) selected_train.drop(['row_id', 'train'] + cat_list, axis=1, inplace=True) selected_test.drop(['row_id', 'train'] + cat_list, axis=1, inplace=True) selected_train, selected_test = np.array(selected_train), np.array(selected_test) print(selected_train.shape, selected_test.shape) return selected_train, selected_test def proj_num_on_cat(train_df, test_df, target_column, group_column): """ :param train_df: train data frame :param test_df: test data frame :param target_column: name of numerical feature :param group_column: name of categorical feature """ train_df['row_id'] = range(train_df.shape[0]) test_df['row_id'] = range(test_df.shape[0]) train_df['train'] = 1 test_df['train'] = 0 all_df = train_df[['row_id', 'train', target_column, group_column]].append(test_df[['row_id','train', target_column, group_column]]) grouped = all_df[[target_column, group_column]].groupby(group_column) the_size = pd.DataFrame(grouped.size()).reset_index() the_size.columns = [group_column, '%s_size' % target_column] the_mean = pd.DataFrame(grouped.mean()).reset_index() the_mean.columns = [group_column, '%s_mean' % target_column] the_std = pd.DataFrame(grouped.std()).reset_index().fillna(0) the_std.columns = [group_column, '%s_std' % target_column] the_median = pd.DataFrame(grouped.median()).reset_index() the_median.columns = [group_column, '%s_median' % target_column] the_stats = pd.merge(the_size, the_mean) the_stats = pd.merge(the_stats, the_std) the_stats = pd.merge(the_stats, the_median) the_max = pd.DataFrame(grouped.max()).reset_index() the_max.columns = [group_column, '%s_max' % target_column] the_min = pd.DataFrame(grouped.min()).reset_index() the_min.columns = [group_column, '%s_min' % target_column] the_stats = pd.merge(the_stats, the_max) the_stats = pd.merge(the_stats, the_min) all_df = pd.merge(all_df, the_stats, how='left') selected_train = all_df[all_df['train'] == 1] selected_test = all_df[all_df['train'] == 0] selected_train.sort_values('row_id', inplace=True) selected_test.sort_values('row_id', inplace=True) selected_train.drop([target_column, group_column, 'row_id', 'train'], axis=1, inplace=True) selected_test.drop([target_column, group_column, 'row_id', 'train'], axis=1, inplace=True) selected_train, selected_test = np.array(selected_train), np.array(selected_test) print(selected_train.shape, selected_test.shape) return selected_train, selected_test def interaction_features(train, test, fea1, fea2, prefix): train['inter_{}*'.format(prefix)] = train[fea1] * train[fea2] train['inter_{}/'.format(prefix)] = train[fea1] / train[fea2] test['inter_{}*'.format(prefix)] = test[fea1] * test[fea2] test['inter_{}/'.format(prefix)] = test[fea1] / test[fea2] return train, test ================================================ FILE: code_for_exact_solution/keras3.py ================================================ import os import sys import operator import numpy as np import pandas as pd from scipy import sparse import xgboost as xgb from sklearn import model_selection, preprocessing, ensemble from keras.models import Sequential from keras.layers import Dense, Dropout, Activation from keras.layers.normalization import BatchNormalization from keras.layers.advanced_activations import PReLU from sklearn.preprocessing import OneHotEncoder from sklearn.model_selection import StratifiedKFold from time import time import datetime from util import Gini, proj_num_on_cat, interaction_features, cat_count from sklearn.preprocessing import StandardScaler from itertools import combinations import pickle ''' simple xgboost benchmark ''' from sklearn.model_selection import StratifiedKFold, KFold from sklearn.feature_selection import SelectPercentile, f_classif import numpy as np import pandas as pd from util import Gini, proj_num_on_cat, interaction_features, cat_count from sklearn.preprocessing import LabelEncoder from itertools import combinations from util import proj_num_on_cat from sklearn.preprocessing import OneHotEncoder import category_encoders as ce from scipy import sparse from sklearn.linear_model import LogisticRegression as LR from sklearn.preprocessing import StandardScaler from scipy.sparse import csr_matrix import pickle cv_only = True save_cv = True full_train = True def evalerror(preds, dtrain): labels = dtrain.get_label() return 'gini', Gini(labels, preds) NFOLDS = 5 kfold = StratifiedKFold(n_splits=NFOLDS, shuffle=True, random_state=218) train = pd.read_csv("../input/train.csv") train_label = train['target'] train_id = train['id'] del train['target'], train['id'] test = pd.read_csv("../input/test.csv") test_id = test['id'] del test['id'] cat_fea = [x for x in list(train) if 'cat' in x] bin_fea = [x for x in list(train) if 'bin' in x] train['missing'] = (train==-1).sum(axis=1).astype(float) test['missing'] = (test==-1).sum(axis=1).astype(float) # include interactions for e, (x, y) in enumerate(combinations(['ps_car_13', 'ps_ind_03', 'ps_reg_03', 'ps_ind_15', 'ps_reg_01', 'ps_ind_01'], 2)): train, test = interaction_features(train, test, x, y, e) num_features = [c for c in list(train) if ('cat' not in c and 'calc' not in c)] num_features.append('missing') inter_fea = [x for x in list(train) if 'inter' in x] #train['cat_sum'] = train[cat_fea].sum(axis=1) #test['cat_sum'] = test[cat_fea].sum(axis=1) #X = train.as_matrix() #X_test = test.as_matrix() #print(X.shape, X_test.shape) #ohe ohe = OneHotEncoder(sparse=True) train_cat = train[cat_fea].as_matrix() train_num = train[[x for x in list(train) if x in num_features]] test_cat = test[cat_fea].as_matrix() test_num = test[[x for x in list(train) if x in num_features]] train_cat[train_cat < 0] = 99 test_cat[test_cat < 0] = 99 traintest = np.vstack((train_cat, test_cat)) traintest = pd.DataFrame(traintest, columns=cat_fea) print(traintest.shape) #encoder = ce.HelmertEncoder(cols=cat_fea) #encoder.fit(traintest) #train_enc = encoder.transform(pd.DataFrame(train_cat, columns=cat_fea)) #test_enc = encoder.transform(pd.DataFrame(test_cat, columns=cat_fea)) ohe.fit(traintest) train_ohe = ohe.transform(train_cat) test_ohe = ohe.transform(test_cat) del traintest train_fea0, test_fea0 = pickle.load(open("../input/fea0.pk")) cat_count_features = [] for c in cat_fea: d = pd.concat([train[c],test[c]]).value_counts().to_dict() train['%s_count'%c] = train[c].apply(lambda x:d.get(x,0)) test['%s_count'%c] = test[c].apply(lambda x:d.get(x,0)) cat_count_features.append('%s_count'%c) train_list = [train_num.replace([np.inf, -np.inf, np.nan], 0), train_ohe, train[cat_count_features], train_fea0]#, np.ones(shape=(train_num.shape[0], 1))] test_list = [test_num.replace([np.inf, -np.inf, np.nan], 0), test_ohe, test[cat_count_features], test_fea0]#, np.ones(shape=(test_num.shape[0], 1))] #proj for t in ['ps_car_13', 'ps_ind_03', 'ps_reg_03', 'ps_ind_15', 'ps_reg_01', 'ps_ind_01']: for g in ['ps_car_13', 'ps_ind_03', 'ps_reg_03', 'ps_ind_15', 'ps_reg_01', 'ps_ind_01', 'ps_ind_05_cat']: if t != g: s_train, s_test = proj_num_on_cat(train, test, target_column=t, group_column=g) train_list.append(s_train) test_list.append(s_test) X = sparse.hstack(train_list).tocsr() X_test = sparse.hstack(test_list).tocsr() #X = train_num #X_test = test_num all_data = np.vstack([X.toarray(), X_test.toarray()]) #all_data = np.vstack([X, X_test]) scaler = StandardScaler() scaler.fit(all_data) X = scaler.transform(X.toarray()) X_test = scaler.transform(X_test.toarray()) #X = scaler.transform(X) #X_test = scaler.transform(X_test) print(X.shape, X_test.shape) cv_train = np.zeros(len(train_label)) cv_pred = np.zeros(len(test_id)) def nn_model(): model = Sequential() model.add(Dense(512, input_dim=X.shape[1], init='he_normal')) model.add(PReLU()) model.add(BatchNormalization()) model.add(Dropout(0.9)) model.add(Dense(64, init='he_normal')) model.add(PReLU()) model.add(BatchNormalization()) model.add(Dropout(0.8)) model.add(Dense(1, init='he_normal', activation='sigmoid')) model.compile(loss='binary_crossentropy', optimizer='adam') return (model) num_seeds = 5 begintime = time() if cv_only: for s in xrange(num_seeds): np.random.seed(s) for (inTr, inTe) in kfold.split(X, train_label): xtr = X[inTr] ytr = train_label[inTr] xte = X[inTe] yte = train_label[inTe] model = nn_model() model.fit(xtr, ytr, epochs=35, batch_size=512, verbose=2, validation_data=[xte, yte]) cv_train[inTe] += model.predict_proba(x=xte, batch_size=512, verbose=0)[:, 0] cv_pred += model.predict_proba(x=X_test, batch_size=512, verbose=0)[:, 0] print(s) print(Gini(train_label, cv_train / (1. * (s + 1)))) print(str(datetime.timedelta(seconds=time() - begintime))) if save_cv: pd.DataFrame({'id': test_id, 'target': cv_pred * 1./ (NFOLDS * num_seeds)}).to_csv('../model/keras3_pred.csv', index=False) pd.DataFrame({'id': train_id, 'target': cv_train * 1. / num_seeds}).to_csv('../model/keras3_cv.csv', index=False) ================================================ FILE: code_for_exact_solution/keras6.py ================================================ import os import sys import operator import numpy as np import pandas as pd from scipy import sparse import xgboost as xgb from sklearn import model_selection, preprocessing, ensemble from keras.models import Sequential from keras.layers import Dense, Dropout, Activation, Embedding, Flatten, concatenate, Input, merge from keras.layers.normalization import BatchNormalization from keras.layers.advanced_activations import PReLU from sklearn.preprocessing import OneHotEncoder from sklearn.model_selection import StratifiedKFold from time import time import datetime from util import Gini, proj_num_on_cat, interaction_features, cat_count from sklearn.preprocessing import StandardScaler from itertools import combinations import pickle from keras.models import Model ''' simple xgboost benchmark ''' from sklearn.model_selection import StratifiedKFold, KFold from sklearn.feature_selection import SelectPercentile, f_classif import numpy as np import pandas as pd from util import Gini, proj_num_on_cat, interaction_features, cat_count from sklearn.preprocessing import LabelEncoder from itertools import combinations from util import proj_num_on_cat from sklearn.preprocessing import OneHotEncoder import category_encoders as ce from scipy import sparse from sklearn.linear_model import LogisticRegression as LR from sklearn.preprocessing import StandardScaler from scipy.sparse import csr_matrix import pickle cv_only = True save_cv = True full_train = True def evalerror(preds, dtrain): labels = dtrain.get_label() return 'gini', Gini(labels, preds) NFOLDS = 5 kfold = StratifiedKFold(n_splits=NFOLDS, shuffle=True, random_state=218) train = pd.read_csv("../input/train.csv") train_label = train['target'] train_id = train['id'] del train['target'], train['id'] test = pd.read_csv("../input/test.csv") test_id = test['id'] del test['id'] cat_fea = [x for x in list(train) if 'cat' in x] bin_fea = [x for x in list(train) if 'bin' in x] train['missing'] = (train==-1).sum(axis=1).astype(float) test['missing'] = (test==-1).sum(axis=1).astype(float) # include interactions for e, (x, y) in enumerate(combinations(['ps_car_13', 'ps_ind_03', 'ps_reg_03', 'ps_ind_15', 'ps_reg_01', 'ps_ind_01'], 2)): train, test = interaction_features(train, test, x, y, e) num_features = [c for c in list(train) if ('cat' not in c and 'calc' not in c)] num_features.append('missing') inter_fea = [x for x in list(train) if 'inter' in x] #train['cat_sum'] = train[cat_fea].sum(axis=1) #test['cat_sum'] = test[cat_fea].sum(axis=1) path = "../input/" num_features_comb = [] for p in os.listdir(path): if 'ps_reg_02___ps_car_07_cat' in p or 'ps_reg_01___ps_car_13___ps_car_15' in p: print(p) x,xt = pd.read_pickle(path+p) train[p] = x test[p] = xt num_features_comb.append(p) num_features += num_features_comb feature_names = list(train) ind_features = [c for c in feature_names if 'ind' in c] count = 0 for c in ind_features: if count == 0: train['new_ind'] = train[c].astype(str) count += 1 else: train['new_ind'] += '_' + train[c].astype(str) print(train['new_ind'].nunique()) ind_features = [c for c in feature_names if 'ind' in c] count = 0 for c in ind_features: if count == 0: test['new_ind'] = test[c].astype(str) count += 1 else: test['new_ind'] += '_' + test[c].astype(str) reg_features = [c for c in feature_names if 'reg' in c] count = 0 for c in reg_features: if count == 0: train['new_reg'] = train[c].astype(str) count += 1 else: train['new_reg'] += '_' + train[c].astype(str) print(train['new_reg'].nunique()) reg_features = [c for c in feature_names if 'reg' in c] count = 0 for c in reg_features: if count == 0: test['new_reg'] = test[c].astype(str) count += 1 else: test['new_reg'] += '_' + test[c].astype(str) car_features = [c for c in feature_names if 'car' in c] count = 0 for c in car_features: if count == 0: train['new_car'] = train[c].astype(str) count += 1 else: train['new_car'] += '_' + train[c].astype(str) print(train['new_car'].nunique()) car_features = [c for c in feature_names if 'car' in c] count = 0 for c in car_features: if count == 0: test['new_car'] = test[c].astype(str) count += 1 else: test['new_car'] += '_' + test[c].astype(str) new_ps_reg_03 = pd.read_pickle(path + 'new_ps_reg_03.pkl') train['ps_reg_03'] = new_ps_reg_03[:train.shape[0]] test['ps_reg_03'] = new_ps_reg_03[train.shape[0]:] print(train['ps_reg_03'].head(10)) #X = train.as_matrix() #X_test = test.as_matrix() #print(X.shape, X_test.shape) #ohe from sklearn.preprocessing import LabelEncoder train_cat = train[cat_fea] train_num = train[[x for x in list(train) if x in num_features]] test_cat = test[cat_fea] test_num = test[[x for x in list(train) if x in num_features]] max_cat_values = [] for c in cat_fea: le = LabelEncoder() x = le.fit_transform(pd.concat([train_cat, test_cat])[c]) train_cat[c] = le.transform(train_cat[c]) test_cat[c] = le.transform(test_cat[c]) max_cat_values.append(np.max(x)) #train_fea0, test_fea0 = pickle.load(open("../input/fea0.pk")) cat_count_features = [] for c in cat_fea + ['new_ind','new_reg','new_car']: d = pd.concat([train[c],test[c]]).value_counts().to_dict() train['%s_count'%c] = train[c].apply(lambda x:d.get(x,0)) test['%s_count'%c] = test[c].apply(lambda x:d.get(x,0)) cat_count_features.append('%s_count'%c) print(train_num.dtypes) train_list = [train_num.replace([np.inf, -np.inf, np.nan], 0), train[cat_count_features]]#, np.ones(shape=(train_num.shape[0], 1))] test_list = [test_num.replace([np.inf, -np.inf, np.nan], 0), test[cat_count_features]]#, np.ones(shape=(test_num.shape[0], 1))] #proj for t in ['ps_car_13', 'ps_ind_03', 'ps_reg_03', 'ps_ind_15', 'ps_reg_01', 'ps_ind_01']: for g in ['ps_car_13', 'ps_ind_03', 'ps_reg_03', 'ps_ind_15', 'ps_reg_01', 'ps_ind_01', 'ps_ind_05_cat']: if t != g: s_train, s_test = proj_num_on_cat(train, test, target_column=t, group_column=g) train_list.append(s_train) test_list.append(s_test) X = sparse.hstack(train_list).tocsr() X_test = sparse.hstack(test_list).tocsr() #X = train_num #X_test = test_num all_data = np.vstack([X.toarray(), X_test.toarray()]) #all_data = np.vstack([X, X_test]) scaler = StandardScaler() scaler.fit(all_data) X = scaler.transform(X.toarray()) X_test = scaler.transform(X_test.toarray()) #X = scaler.transform(X) #X_test = scaler.transform(X_test) print(X.shape, X_test.shape) cv_train = np.zeros(len(train_label)) cv_pred = np.zeros(len(test_id)) def nn_model(): inputs = [] flatten_layers = [] for e, c in enumerate(cat_fea): input_c = Input(shape=(1, ), dtype='int32') num_c = max_cat_values[e] embed_c = Embedding( num_c, 64, input_length=1 )(embed_c) embed_c = Dropout(0.25)(embed_c) flatten_c = Flatten()(embed_c) inputs.append(input_c) flatten_layers.append(flatten_c) input_num = Input(shape=(X.shape[1],), dtype='float32') flatten_layers.append(input_num) inputs.append(input_num) flatten = merge(flatten_layers, mode='concat') fc1 = Dense(512, kernel_init='he_normal')(flatten) fc1 = PReLU()(fc1) fc1 = BatchNormalization()(fc1) fc1 = Dropout(0.8)(fc1) fc1 = Dense(64, kernel_init='he_normal')(fc1) fc1 = PReLU()(fc1) fc1 = BatchNormalization()(fc1) fc1 = Dropout(0.8)(fc1) outputs = Dense(1, kernel_init='he_normal', activation='sigmoid')(fc1) model = Model(input = inputs, output = outputs) model.compile(loss='binary_crossentropy', optimizer='adam') return (model) X_cat = train_cat.as_matrix() X_test_cat = test_cat.as_matrix() x_test_cat = [] for i in xrange(X_test_cat.shape[1]): x_test_cat.append(X_test_cat[:, i].reshape(-1, 1)) x_test_cat.append(X_test) num_seeds = 5 def nn_model(): inputs = [] flatten_layers = [] for e, c in enumerate(cat_fea): input_c = Input(shape=(1, ), dtype='int32') num_c = max_cat_values[e] embed_c = Embedding( num_c, 6, input_length=1 )(input_c) embed_c = Dropout(0.25)(embed_c) flatten_c = Flatten()(embed_c) inputs.append(input_c) flatten_layers.append(flatten_c) input_num = Input(shape=(X.shape[1],), dtype='float32') flatten_layers.append(input_num) inputs.append(input_num) flatten = merge(flatten_layers, mode='concat') fc1 = Dense(512, init='he_normal')(flatten) fc1 = PReLU()(fc1) fc1 = BatchNormalization()(fc1) fc1 = Dropout(0.75)(fc1) fc1 = Dense(64, init='he_normal')(fc1) fc1 = PReLU()(fc1) fc1 = BatchNormalization()(fc1) fc1 = Dropout(0.5)(fc1) outputs = Dense(1, init='he_normal', activation='sigmoid')(fc1) model = Model(input = inputs, output = outputs) model.compile(loss='binary_crossentropy', optimizer='adam') return (model) num_seeds = 5 begintime = time() if cv_only: for s in xrange(num_seeds): np.random.seed(s) for (inTr, inTe) in kfold.split(X, train_label): xtr = X[inTr] ytr = train_label[inTr] xte = X[inTe] yte = train_label[inTe] xtr_cat = X_cat[inTr] xte_cat = X_cat[inTe] # get xtr xte cat xtr_cat_list, xte_cat_list = [], [] for i in xrange(xtr_cat.shape[1]): xtr_cat_list.append(xtr_cat[:, i].reshape(-1, 1)) xte_cat_list.append(xte_cat[:, i].reshape(-1, 1)) xtr_cat_list.append(xtr) xte_cat_list.append(xte) model = nn_model() def get_rank(x): return pd.Series(x).rank(pct=True).values model.fit(xtr_cat_list, ytr, epochs=20, batch_size=512, verbose=2, validation_data=[xte_cat_list, yte]) cv_train[inTe] += get_rank(model.predict(x=xte_cat_list, batch_size=512, verbose=0)[:, 0]) print(Gini(train_label[inTe], cv_train[inTe])) cv_pred += get_rank(model.predict(x=x_test_cat, batch_size=512, verbose=0)[:, 0]) print(s) print(Gini(train_label, cv_train / (1. * (s + 1)))) print(str(datetime.timedelta(seconds=time() - begintime))) if save_cv: pd.DataFrame({'id': test_id, 'target': get_rank(cv_pred * 1./ (NFOLDS * num_seeds))}).to_csv('../model/keras6_pred.csv', index=False) pd.DataFrame({'id': train_id, 'target': get_rank(cv_train * 1. / num_seeds)}).to_csv('../model/keras6_cv.csv', index=False) ================================================ FILE: code_for_exact_solution/keras7.py ================================================ import os import sys import operator import numpy as np import pandas as pd from scipy import sparse import xgboost as xgb from sklearn import model_selection, preprocessing, ensemble from keras.models import Sequential from keras.layers import Dense, Dropout, Activation, Embedding, Flatten, concatenate, Input, merge from keras.layers.normalization import BatchNormalization from keras.layers.advanced_activations import PReLU from sklearn.preprocessing import OneHotEncoder from sklearn.model_selection import StratifiedKFold from time import time import datetime from util import Gini, proj_num_on_cat, interaction_features, cat_count from sklearn.preprocessing import StandardScaler from itertools import combinations import pickle from keras.models import Model ''' simple xgboost benchmark ''' from sklearn.model_selection import StratifiedKFold, KFold from sklearn.feature_selection import SelectPercentile, f_classif import numpy as np import pandas as pd from util import Gini, proj_num_on_cat, interaction_features, cat_count from sklearn.preprocessing import LabelEncoder from itertools import combinations from util import proj_num_on_cat from sklearn.preprocessing import OneHotEncoder import category_encoders as ce from scipy import sparse from sklearn.linear_model import LogisticRegression as LR from sklearn.preprocessing import StandardScaler from scipy.sparse import csr_matrix import pickle cv_only = True save_cv = True full_train = True def evalerror(preds, dtrain): labels = dtrain.get_label() return 'gini', Gini(labels, preds) NFOLDS = 5 kfold = StratifiedKFold(n_splits=NFOLDS, shuffle=True, random_state=218) train = pd.read_csv("../input/train.csv") train_label = train['target'] train_id = train['id'] del train['target'], train['id'] test = pd.read_csv("../input/test.csv") test_id = test['id'] del test['id'] cat_fea = [x for x in list(train) if 'cat' in x] bin_fea = [x for x in list(train) if 'bin' in x] train['missing'] = (train==-1).sum(axis=1).astype(float) test['missing'] = (test==-1).sum(axis=1).astype(float) # include interactions for e, (x, y) in enumerate(combinations(['ps_car_13', 'ps_ind_03', 'ps_reg_03', 'ps_ind_15', 'ps_reg_01', 'ps_ind_01'], 2)): train, test = interaction_features(train, test, x, y, e) num_features = [c for c in list(train) if ('cat' not in c and 'calc' not in c)] num_features.append('missing') inter_fea = [x for x in list(train) if 'inter' in x] #train['cat_sum'] = train[cat_fea].sum(axis=1) #test['cat_sum'] = test[cat_fea].sum(axis=1) path = "../input/" num_features_comb = [] for p in os.listdir(path): if 'ps_reg_02___ps_car_07_cat' in p or 'ps_reg_01___ps_car_13___ps_car_15' in p: print(p) x,xt = pd.read_pickle(path+p) train[p] = x test[p] = xt num_features_comb.append(p) num_features += num_features_comb feature_names = list(train) ind_features = [c for c in feature_names if 'ind' in c] count = 0 for c in ind_features: if count == 0: train['new_ind'] = train[c].astype(str) count += 1 else: train['new_ind'] += '_' + train[c].astype(str) print(train['new_ind'].nunique()) ind_features = [c for c in feature_names if 'ind' in c] count = 0 for c in ind_features: if count == 0: test['new_ind'] = test[c].astype(str) count += 1 else: test['new_ind'] += '_' + test[c].astype(str) reg_features = [c for c in feature_names if 'reg' in c] count = 0 for c in reg_features: if count == 0: train['new_reg'] = train[c].astype(str) count += 1 else: train['new_reg'] += '_' + train[c].astype(str) print(train['new_reg'].nunique()) reg_features = [c for c in feature_names if 'reg' in c] count = 0 for c in reg_features: if count == 0: test['new_reg'] = test[c].astype(str) count += 1 else: test['new_reg'] += '_' + test[c].astype(str) car_features = [c for c in feature_names if 'car' in c] count = 0 for c in car_features: if count == 0: train['new_car'] = train[c].astype(str) count += 1 else: train['new_car'] += '_' + train[c].astype(str) print(train['new_car'].nunique()) car_features = [c for c in feature_names if 'car' in c] count = 0 for c in car_features: if count == 0: test['new_car'] = test[c].astype(str) count += 1 else: test['new_car'] += '_' + test[c].astype(str) new_ps_reg_03 = pd.read_pickle(path + 'new_ps_reg_03.pkl') train['ps_reg_03'] = new_ps_reg_03[:train.shape[0]] test['ps_reg_03'] = new_ps_reg_03[train.shape[0]:] print(train['ps_reg_03'].head(10)) #X = train.as_matrix() #X_test = test.as_matrix() #print(X.shape, X_test.shape) #ohe from sklearn.preprocessing import LabelEncoder train_cat = train[cat_fea] train_num = train[[x for x in list(train) if x in num_features]] test_cat = test[cat_fea] test_num = test[[x for x in list(train) if x in num_features]] max_cat_values = [] for c in cat_fea: le = LabelEncoder() x = le.fit_transform(pd.concat([train_cat, test_cat])[c]) train_cat[c] = le.transform(train_cat[c]) test_cat[c] = le.transform(test_cat[c]) max_cat_values.append(np.max(x)) train_fea0, test_fea0 = pickle.load(open("../input/fea0.pk")) cat_count_features = [] for c in cat_fea + ['new_ind','new_reg','new_car']: d = pd.concat([train[c],test[c]]).value_counts().to_dict() train['%s_count'%c] = train[c].apply(lambda x:d.get(x,0)) test['%s_count'%c] = test[c].apply(lambda x:d.get(x,0)) cat_count_features.append('%s_count'%c) print(train_num.dtypes) train_list = [train_num.replace([np.inf, -np.inf, np.nan], 0), train[cat_count_features], train_fea0]#, np.ones(shape=(train_num.shape[0], 1))] test_list = [test_num.replace([np.inf, -np.inf, np.nan], 0), test[cat_count_features], test_fea0]#, np.ones(shape=(test_num.shape[0], 1))] #proj for t in ['ps_car_13', 'ps_ind_03', 'ps_reg_03', 'ps_ind_15', 'ps_reg_01', 'ps_ind_01']: for g in ['ps_car_13', 'ps_ind_03', 'ps_reg_03', 'ps_ind_15', 'ps_reg_01', 'ps_ind_01', 'ps_ind_05_cat']: if t != g: s_train, s_test = proj_num_on_cat(train, test, target_column=t, group_column=g) train_list.append(s_train) test_list.append(s_test) X = sparse.hstack(train_list).tocsr() X_test = sparse.hstack(test_list).tocsr() #X = train_num #X_test = test_num all_data = np.vstack([X.toarray(), X_test.toarray()]) #all_data = np.vstack([X, X_test]) scaler = StandardScaler() scaler.fit(all_data) X = scaler.transform(X.toarray()) X_test = scaler.transform(X_test.toarray()) #X = scaler.transform(X) #X_test = scaler.transform(X_test) print(X.shape, X_test.shape) cv_train = np.zeros(len(train_label)) cv_pred = np.zeros(len(test_id)) X_cat = train_cat.as_matrix() X_test_cat = test_cat.as_matrix() x_test_cat = [] for i in xrange(X_test_cat.shape[1]): x_test_cat.append(X_test_cat[:, i].reshape(-1, 1)) x_test_cat.append(X_test) X_cat = train_cat.as_matrix() X_test_cat = test_cat.as_matrix() x_test_cat = [] for i in xrange(X_test_cat.shape[1]): x_test_cat.append(X_test_cat[:, i].reshape(-1, 1)) x_test_cat.append(X_test) num_seeds = 5 def nn_model(): inputs = [] flatten_layers = [] for e, c in enumerate(cat_fea): input_c = Input(shape=(1, ), dtype='int32') num_c = max_cat_values[e] embed_c = Embedding( num_c, 3, input_length=1 )(input_c) embed_c = Dropout(0.)(embed_c) flatten_c = Flatten()(embed_c) inputs.append(input_c) flatten_layers.append(flatten_c) input_num = Input(shape=(X.shape[1],), dtype='float32') flatten_layers.append(input_num) inputs.append(input_num) flatten = merge(flatten_layers, mode='concat') fc1 = Dense(512, init='he_normal')(flatten) fc1 = PReLU()(fc1) fc1 = BatchNormalization()(fc1) fc1 = Dropout(0.8)(fc1) fc1 = Dense(64, init='he_normal')(fc1) fc1 = PReLU()(fc1) fc1 = BatchNormalization()(fc1) fc1 = Dropout(0.8)(fc1) outputs = Dense(1, init='he_normal', activation='sigmoid')(fc1) model = Model(input = inputs, output = outputs) model.compile(loss='binary_crossentropy', optimizer='adam') return (model) num_seeds = 3 begintime = time() if cv_only: for s in xrange(num_seeds): np.random.seed(s) for (inTr, inTe) in kfold.split(X, train_label): xtr = X[inTr] ytr = train_label[inTr] xte = X[inTe] yte = train_label[inTe] xtr_cat = X_cat[inTr] xte_cat = X_cat[inTe] # get xtr xte cat xtr_cat_list, xte_cat_list = [], [] for i in xrange(xtr_cat.shape[1]): xtr_cat_list.append(xtr_cat[:, i].reshape(-1, 1)) xte_cat_list.append(xte_cat[:, i].reshape(-1, 1)) xtr_cat_list.append(xtr) xte_cat_list.append(xte) model = nn_model() def get_rank(x): return pd.Series(x).rank(pct=True).values model.fit(xtr_cat_list, ytr, epochs=30, batch_size=512, verbose=2, validation_data=[xte_cat_list, yte]) cv_train[inTe] += get_rank(model.predict(x=xte_cat_list, batch_size=512, verbose=0)[:, 0]) print(Gini(train_label[inTe], cv_train[inTe])) cv_pred += get_rank(model.predict(x=x_test_cat, batch_size=512, verbose=0)[:, 0]) print(s) print(Gini(train_label, cv_train / (1. * (s + 1)))) print(str(datetime.timedelta(seconds=time() - begintime))) pd.DataFrame({'id': test_id, 'target': get_rank(cv_pred * 1./ (NFOLDS * num_seeds))}).to_csv('../model/keras7_pred_0.csv', index=False) pd.DataFrame({'id': train_id, 'target': get_rank(cv_train * 1. / num_seeds)}).to_csv('../model/keras7_cv_0.csv', index=False) if save_cv: pd.DataFrame({'id': test_id, 'target': get_rank(cv_pred * 1./ (NFOLDS * num_seeds))}).to_csv('../model/keras7_pred.csv', index=False) pd.DataFrame({'id': train_id, 'target': get_rank(cv_train * 1. / num_seeds)}).to_csv('../model/keras7_cv.csv', index=False) ================================================ FILE: code_for_exact_solution/lightgbm1.py ================================================ ''' simple xgboost benchmark ''' import lightgbm as lgbm from sklearn.model_selection import StratifiedKFold import numpy as np import pandas as pd from util import Gini from sklearn.preprocessing import LabelEncoder from itertools import combinations from util import proj_num_on_cat, cat_count from sklearn.preprocessing import OneHotEncoder from scipy import sparse cv_only = True save_cv = True full_train = False def evalerror(preds, dtrain): labels = dtrain.get_label() return 'gini', Gini(labels, preds), True NFOLDS = 5 kfold = StratifiedKFold(n_splits=NFOLDS, shuffle=True, random_state=218) learning_rate = 0.1 num_leaves = 15 min_data_in_leaf = 2000 # max_bin = x feature_fraction = 0.6 num_boost_round = 10000 train = pd.read_csv("../input/train.csv") train_label = train['target'] train_id = train['id'] del train['target'], train['id'] test = pd.read_csv("../input/test.csv") test_id = test['id'] del test['id'] cat_list = [x for x in list(train) if 'cat' in x] print(cat_list) train_copy = train.copy() test_copy = test.copy() train_copy = train_copy.replace(-1, np.NaN) test_copy = test_copy.replace(-1, np.NaN) train['num_na'] = train_copy.isnull().sum(axis=1) test['num_na'] = test_copy.isnull().sum(axis=1) del train_copy, test_copy #X = train.as_matrix() #X_test = test.as_matrix() #print(X.shape, X_test.shape) #ohe ohe = OneHotEncoder(sparse=False) train_cat = train[[x for x in list(train) if 'cat' in x]].as_matrix() train_num = train[[x for x in list(train) if 'cat' not in x]] test_cat = test[[x for x in list(test) if 'cat' in x]].as_matrix() test_num = test[[x for x in list(test) if 'cat' not in x]] train_cat[train_cat < 0] = 99 test_cat[test_cat < 0] = 99 train_ohe = ohe.fit_transform(train_cat) test_ohe = ohe.transform(test_cat) print("cat_list now:", cat_list) train_cat_count, test_cat_count = cat_count(train, test, cat_list) print("cat count shape:", train_cat_count.shape, test_cat_count.shape) X = sparse.hstack([train_num, train_ohe, train_cat_count]).tocsr() X_test = sparse.hstack([test_num, test_ohe, test_cat_count]).tocsr() print(X.shape, X_test.shape) params = {"objective": "binary", "boosting_type": "gbdt", "learning_rate": learning_rate, "num_leaves": int(num_leaves), "max_bin": 256, "min_data_in_leaf": min_data_in_leaf, "feature_fraction": feature_fraction, "verbosity": 0, "seed": 218, "drop_rate": 0.1, "is_unbalance": False, "max_drop": 50, "min_child_samples": 10, "min_child_weight": 150, "min_split_gain": 0, "subsample": 0.9 } x_score = [] final_cv_train = np.zeros(len(train_label)) final_cv_pred = np.zeros(len(test_id)) for s in xrange(16): cv_train = np.zeros(len(train_label)) cv_pred = np.zeros(len(test_id)) params['seed'] = s if cv_only: kf = kfold.split(X, train_label) best_trees = [] fold_scores = [] for i, (train_fold, validate) in enumerate(kf): X_train, X_validate, label_train, label_validate = \ X[train_fold, :], X[validate, :], train_label[train_fold], train_label[validate] dtrain = lgbm.Dataset(X_train, label_train) dvalid = lgbm.Dataset(X_validate, label_validate, reference=dtrain) bst = lgbm.train(params, dtrain, num_boost_round, valid_sets=dvalid, feval=evalerror, verbose_eval=100, early_stopping_rounds=100) best_trees.append(bst.best_iteration) cv_pred += bst.predict(X_test, num_iteration=bst.best_iteration) cv_train[validate] += bst.predict(X_validate) score = Gini(label_validate, cv_train[validate]) print score fold_scores.append(score) cv_pred /= NFOLDS final_cv_train += cv_train final_cv_pred += cv_pred print("cv score:") print Gini(train_label, cv_train) print "current score:", Gini(train_label, final_cv_train / (s + 1.)), s+1 print(fold_scores) print(best_trees, np.mean(best_trees)) x_score.append(Gini(train_label, cv_train)) print(x_score) pd.DataFrame({'id': test_id, 'target': final_cv_pred / 16.}).to_csv('../model/lgbm1_pred_avg.csv', index=False) pd.DataFrame({'id': train_id, 'target': final_cv_train / 16.}).to_csv('../model/lgbm1_cv_avg.csv', index=False) ================================================ FILE: code_for_exact_solution/lightgbm5.py ================================================ import numpy as np import scipy as sp from scipy import sparse as ssp import pandas as pd from sklearn.preprocessing import normalize from scipy.stats import spearmanr import lightgbm as lgb import lightgbm as lgbm from sklearn.model_selection import StratifiedKFold import numpy as np import pandas as pd from util import Gini from sklearn.preprocessing import LabelEncoder from itertools import combinations from util import proj_num_on_cat from sklearn.preprocessing import OneHotEncoder from scipy import sparse import lightgbm as lgbm from sklearn.model_selection import StratifiedKFold import numpy as np import pandas as pd from util import Gini from sklearn.preprocessing import LabelEncoder from itertools import combinations from util import proj_num_on_cat, cat_count from sklearn.preprocessing import OneHotEncoder from scipy import sparse cv_only = True save_cv = True full_train = False import numpy as np import scipy as sp from scipy import sparse as ssp import pandas as pd import xgboost as xgb from sklearn.model_selection import StratifiedKFold, KFold import numpy as np import pandas as pd from sklearn.preprocessing import LabelEncoder from itertools import combinations from util import proj_num_on_cat from sklearn.preprocessing import OneHotEncoder import category_encoders as ce from scipy import sparse from util import Gini, proj_num_on_cat, interaction_features, cat_count def evalerror(preds, dtrain): labels = dtrain.get_label() return 'gini', Gini(labels, preds), True path = "../input/" train = pd.read_csv(path+'train.csv') train_label = train['target'] train_id = train['id'] test = pd.read_csv(path+'test.csv') test_id = test['id'] NFOLDS = 5 kfold = StratifiedKFold(n_splits=NFOLDS, shuffle=True, random_state=218) learning_rate = 0.1 num_leaves = 15 min_data_in_leaf = 2000 # max_bin = x feature_fraction = 0.6 num_boost_round = 10000 y = train['target'].values drop_feature = [ 'id', 'target' ] X = train.drop(drop_feature,axis=1) feature_names = X.columns.tolist() cat_features = [c for c in feature_names if ('cat' in c and 'count' not in c)] num_features = [c for c in feature_names if ('cat' not in c and 'calc' not in c)] train['missing'] = (train==-1).sum(axis=1).astype(float) test['missing'] = (test==-1).sum(axis=1).astype(float) num_features.append('missing') from sklearn.preprocessing import OneHotEncoder,LabelEncoder for c in cat_features: le = LabelEncoder() le.fit(train[c]) train[c] = le.transform(train[c]) test[c] = le.transform(test[c]) from sklearn.preprocessing import normalize from scipy.stats import spearmanr from sklearn.preprocessing import OneHotEncoder enc = OneHotEncoder() enc.fit(train[cat_features]) X_cat = enc.transform(train[cat_features]) X_t_cat = enc.transform(test[cat_features]) ind_features = [c for c in feature_names if 'ind' in c] count=0 for c in ind_features: if count==0: train['new_ind'] = train[c].astype(str)+'_' test['new_ind'] = test[c].astype(str)+'_' count+=1 else: train['new_ind'] += train[c].astype(str)+'_' test['new_ind'] += test[c].astype(str)+'_' cat_count_features = [] for c in cat_features+['new_ind']: d = pd.concat([train[c],test[c]]).value_counts().to_dict() train['%s_count'%c] = train[c].apply(lambda x:d.get(x,0)) test['%s_count'%c] = test[c].apply(lambda x:d.get(x,0)) cat_count_features.append('%s_count'%c) score = spearmanr(train['target'],train['%s_count'%c]) print(c,score) train_list = [train[num_features+cat_count_features].values,X_cat,] test_list = [test[num_features+cat_count_features].values,X_t_cat,] # missing binary projections #missing_list = [x for x in list(train) if np.sum(train[x] == -1) > 0] #for miss_fea in missing_list: # train['{}_miss_code'.format(miss_fea)] = (train[miss_fea] == -1).astype(int) # test['{}_miss_code'.format(miss_fea)] = (test[miss_fea] == -1).astype(int) X = ssp.hstack(train_list).tocsr() X_test = ssp.hstack(test_list).tocsr() params = {"objective": "poisson", "boosting_type": "gbdt", "learning_rate": learning_rate, "num_leaves": int(num_leaves), "max_bin": 256, "feature_fraction": feature_fraction, "verbosity": 0, "drop_rate": 0.1, "is_unbalance": False, "max_drop": 50, "min_child_samples": 10, "min_child_weight": 150, "min_split_gain": 0, "subsample": 0.9 } x_score = [] final_cv_train = np.zeros(len(train_label)) final_cv_pred = np.zeros(len(test_id)) for s in xrange(10): cv_train = np.zeros(len(train_label)) cv_pred = np.zeros(len(test_id)) params['seed'] = s if cv_only: kf = kfold.split(X, train_label) best_trees = [] fold_scores = [] for i, (train_fold, validate) in enumerate(kf): X_train, X_validate, label_train, label_validate = \ X[train_fold, :], X[validate, :], train_label[train_fold], train_label[validate] dtrain = lgbm.Dataset(X_train, label_train) dvalid = lgbm.Dataset(X_validate, label_validate, reference=dtrain) bst = lgbm.train(params, dtrain, num_boost_round, valid_sets=dvalid, feval=evalerror, verbose_eval=100, early_stopping_rounds=100) best_trees.append(bst.best_iteration) cv_pred += bst.predict(X_test, num_iteration=bst.best_iteration) cv_train[validate] += bst.predict(X_validate) score = Gini(label_validate, cv_train[validate]) print score fold_scores.append(score) cv_pred /= NFOLDS final_cv_train += cv_train final_cv_pred += cv_pred print("cv score:") print Gini(train_label, cv_train) print "current score:", Gini(train_label, final_cv_train / (s + 1.)), s+1 print(fold_scores) print(best_trees, np.mean(best_trees)) x_score.append(Gini(train_label, cv_train)) print(x_score) pred_result = pd.DataFrame({'id': test_id, 'target': final_cv_pred / 10.}) pred_result['target'] = pred_result['target'].rank(pct=True) pred_result.to_csv('../model/lgbm5_pred_avg.csv', index=False) cv_result = pd.DataFrame({'id': train_id, 'target': final_cv_train / 10.}) cv_result['target'] = cv_result['target'].rank(pct=True) cv_result.to_csv('../model/lgbm5_cv_avg.csv', index=False) #cv score: #0.287007087138 #current score: 0.289683837899 16 ================================================ FILE: code_for_exact_solution/lightgbm6.py ================================================ import numpy as np import scipy as sp from scipy import sparse as ssp import pandas as pd from sklearn.preprocessing import normalize from scipy.stats import spearmanr import lightgbm as lgb import lightgbm as lgbm from sklearn.model_selection import StratifiedKFold import numpy as np import pandas as pd from util import Gini from sklearn.preprocessing import LabelEncoder from itertools import combinations from util import proj_num_on_cat from sklearn.preprocessing import OneHotEncoder from scipy import sparse import lightgbm as lgbm from sklearn.model_selection import StratifiedKFold import numpy as np import pandas as pd from util import Gini from sklearn.preprocessing import LabelEncoder from itertools import combinations from util import proj_num_on_cat, cat_count from sklearn.preprocessing import OneHotEncoder from scipy import sparse cv_only = True save_cv = True full_train = False import numpy as np import scipy as sp from scipy import sparse as ssp import pandas as pd import xgboost as xgb from sklearn.model_selection import StratifiedKFold, KFold import numpy as np import pandas as pd from sklearn.preprocessing import LabelEncoder from itertools import combinations from util import proj_num_on_cat from sklearn.preprocessing import OneHotEncoder import category_encoders as ce from scipy import sparse from util import Gini, proj_num_on_cat, interaction_features, cat_count def evalerror(preds, dtrain): labels = dtrain.get_label() return 'gini', Gini(labels, preds), True path = "../input/" train = pd.read_csv(path+'train.csv') train_label = train['target'] train_id = train['id'] test = pd.read_csv(path+'test.csv') test_id = test['id'] NFOLDS = 5 kfold = StratifiedKFold(n_splits=NFOLDS, shuffle=True, random_state=218) learning_rate = 0.1 num_leaves = 15 min_data_in_leaf = 2000 # max_bin = x feature_fraction = 0.6 num_boost_round = 10000 y = train['target'].values drop_feature = [ 'id', 'target' ] X = train.drop(drop_feature,axis=1) feature_names = X.columns.tolist() cat_features = [c for c in feature_names if ('cat' in c and 'count' not in c)] num_features = [c for c in feature_names if ('cat' not in c and 'calc' not in c)] train['missing'] = (train==-1).sum(axis=1).astype(float) test['missing'] = (test==-1).sum(axis=1).astype(float) num_features.append('missing') from sklearn.preprocessing import OneHotEncoder,LabelEncoder for c in cat_features: le = LabelEncoder() le.fit(train[c]) train[c] = le.transform(train[c]) test[c] = le.transform(test[c]) from sklearn.preprocessing import normalize from scipy.stats import spearmanr from sklearn.preprocessing import OneHotEncoder enc = OneHotEncoder() enc.fit(train[cat_features]) X_cat = enc.transform(train[cat_features]) X_t_cat = enc.transform(test[cat_features]) ind_features = [c for c in feature_names if 'ind' in c] count=0 for c in ind_features: if count==0: train['new_ind'] = train[c].astype(str)+'_' test['new_ind'] = test[c].astype(str)+'_' count+=1 else: train['new_ind'] += train[c].astype(str)+'_' test['new_ind'] += test[c].astype(str)+'_' cat_count_features = [] for c in cat_features+['new_ind']: d = pd.concat([train[c],test[c]]).value_counts().to_dict() train['%s_count'%c] = train[c].apply(lambda x:d.get(x,0)) test['%s_count'%c] = test[c].apply(lambda x:d.get(x,0)) cat_count_features.append('%s_count'%c) score = spearmanr(train['target'],train['%s_count'%c]) print(c,score) train_list = [train[num_features+cat_count_features].values,X_cat,] test_list = [test[num_features+cat_count_features].values,X_t_cat,] # missing binary projections #missing_list = [x for x in list(train) if np.sum(train[x] == -1) > 0] #for miss_fea in missing_list: # train['{}_miss_code'.format(miss_fea)] = (train[miss_fea] == -1).astype(int) # test['{}_miss_code'.format(miss_fea)] = (test[miss_fea] == -1).astype(int) X = ssp.hstack(train_list).tocsr() X_test = ssp.hstack(test_list).tocsr() params = {"objective": "fair", "boosting_type": "gbdt", "learning_rate": learning_rate, "num_leaves": int(num_leaves), "max_bin": 256, "feature_fraction": feature_fraction, "verbosity": 0, "drop_rate": 0.1, "is_unbalance": False, "max_drop": 50, "min_child_samples": 10, "min_child_weight": 150, "min_split_gain": 0, "subsample": 0.9 } x_score = [] final_cv_train = np.zeros(len(train_label)) final_cv_pred = np.zeros(len(test_id)) for s in xrange(10): cv_train = np.zeros(len(train_label)) cv_pred = np.zeros(len(test_id)) params['seed'] = s if cv_only: kf = kfold.split(X, train_label) best_trees = [] fold_scores = [] for i, (train_fold, validate) in enumerate(kf): X_train, X_validate, label_train, label_validate = \ X[train_fold, :], X[validate, :], train_label[train_fold], train_label[validate] dtrain = lgbm.Dataset(X_train, label_train) dvalid = lgbm.Dataset(X_validate, label_validate, reference=dtrain) bst = lgbm.train(params, dtrain, num_boost_round, valid_sets=dvalid, feval=evalerror, verbose_eval=100, early_stopping_rounds=100) best_trees.append(bst.best_iteration) cv_pred += bst.predict(X_test, num_iteration=bst.best_iteration) cv_train[validate] += bst.predict(X_validate) score = Gini(label_validate, cv_train[validate]) print score fold_scores.append(score) cv_pred /= NFOLDS final_cv_train += cv_train final_cv_pred += cv_pred print("cv score:") print Gini(train_label, cv_train) print "current score:", Gini(train_label, final_cv_train / (s + 1.)), s+1 print(fold_scores) print(best_trees, np.mean(best_trees)) x_score.append(Gini(train_label, cv_train)) print(x_score) pred_result = pd.DataFrame({'id': test_id, 'target': final_cv_pred / 10.}) pred_result['target'] = pred_result['target'].rank(pct=True) pred_result.to_csv('../model/lgbm6_pred_avg.csv', index=False) cv_result = pd.DataFrame({'id': train_id, 'target': final_cv_train / 10.}) cv_result['target'] = cv_result['target'].rank(pct=True) cv_result.to_csv('../model/lgbm6_cv_avg.csv', index=False) #cv score: #0.287007087138 #current score: 0.289683837899 16 ================================================ FILE: code_for_exact_solution/lightgbm7.py ================================================ import numpy as np import scipy as sp from scipy import sparse as ssp import pandas as pd from sklearn.preprocessing import normalize from scipy.stats import spearmanr import lightgbm as lgb import lightgbm as lgbm from sklearn.model_selection import StratifiedKFold import numpy as np import pandas as pd from util import Gini from sklearn.preprocessing import LabelEncoder from itertools import combinations from util import proj_num_on_cat from sklearn.preprocessing import OneHotEncoder from scipy import sparse import lightgbm as lgbm from sklearn.model_selection import StratifiedKFold import numpy as np import pandas as pd from util import Gini from sklearn.preprocessing import LabelEncoder from itertools import combinations from util import proj_num_on_cat, cat_count from sklearn.preprocessing import OneHotEncoder from scipy import sparse cv_only = True save_cv = True full_train = False import numpy as np import scipy as sp from scipy import sparse as ssp import pandas as pd import xgboost as xgb from sklearn.model_selection import StratifiedKFold, KFold import numpy as np import pandas as pd from sklearn.preprocessing import LabelEncoder from itertools import combinations from util import proj_num_on_cat from sklearn.preprocessing import OneHotEncoder import category_encoders as ce from scipy import sparse from util import Gini, proj_num_on_cat, interaction_features, cat_count def evalerror(preds, dtrain): labels = dtrain.get_label() return 'gini', Gini(labels, preds), True path = "../input/" train = pd.read_csv(path+'train.csv') train_label = train['target'] train_id = train['id'] test = pd.read_csv(path+'test.csv') test_id = test['id'] NFOLDS = 5 kfold = StratifiedKFold(n_splits=NFOLDS, shuffle=True, random_state=218) learning_rate = 0.1 num_leaves = 15 min_data_in_leaf = 2000 # max_bin = x feature_fraction = 0.6 num_boost_round = 10000 y = train['target'].values drop_feature = [ 'id', 'target' ] X = train.drop(drop_feature,axis=1) feature_names = X.columns.tolist() cat_features = [c for c in feature_names if ('cat' in c and 'count' not in c)] num_features = [c for c in feature_names if ('cat' not in c and 'calc' not in c)] train['missing'] = (train==-1).sum(axis=1).astype(float) test['missing'] = (test==-1).sum(axis=1).astype(float) num_features.append('missing') from sklearn.preprocessing import OneHotEncoder,LabelEncoder for c in cat_features: le = LabelEncoder() le.fit(train[c]) train[c] = le.transform(train[c]) test[c] = le.transform(test[c]) from sklearn.preprocessing import normalize from scipy.stats import spearmanr from sklearn.preprocessing import OneHotEncoder enc = OneHotEncoder() enc.fit(train[cat_features]) X_cat = enc.transform(train[cat_features]) X_t_cat = enc.transform(test[cat_features]) ind_features = [c for c in feature_names if 'ind' in c] count=0 for c in ind_features: if count==0: train['new_ind'] = train[c].astype(str)+'_' test['new_ind'] = test[c].astype(str)+'_' count+=1 else: train['new_ind'] += train[c].astype(str)+'_' test['new_ind'] += test[c].astype(str)+'_' cat_count_features = [] for c in cat_features+['new_ind']: d = pd.concat([train[c],test[c]]).value_counts().to_dict() train['%s_count'%c] = train[c].apply(lambda x:d.get(x,0)) test['%s_count'%c] = test[c].apply(lambda x:d.get(x,0)) cat_count_features.append('%s_count'%c) score = spearmanr(train['target'],train['%s_count'%c]) print(c,score) train_list = [train[num_features+cat_count_features].values,X_cat,] test_list = [test[num_features+cat_count_features].values,X_t_cat,] # missing binary projections #missing_list = [x for x in list(train) if np.sum(train[x] == -1) > 0] #for miss_fea in missing_list: # train['{}_miss_code'.format(miss_fea)] = (train[miss_fea] == -1).astype(int) # test['{}_miss_code'.format(miss_fea)] = (test[miss_fea] == -1).astype(int) X = ssp.hstack(train_list).tocsr() X_test = ssp.hstack(test_list).tocsr() params = {"objective": "binary", "boosting_type": "goss", "learning_rate": learning_rate, "num_leaves": int(num_leaves), "max_bin": 256, "feature_fraction": feature_fraction, "verbosity": 0, "drop_rate": 0.1, "is_unbalance": False, "max_drop": 50, "min_child_samples": 10, "min_child_weight": 150, "min_split_gain": 0, "subsample": 0.9 } x_score = [] final_cv_train = np.zeros(len(train_label)) final_cv_pred = np.zeros(len(test_id)) for s in xrange(10): cv_train = np.zeros(len(train_label)) cv_pred = np.zeros(len(test_id)) params['seed'] = s if cv_only: kf = kfold.split(X, train_label) best_trees = [] fold_scores = [] for i, (train_fold, validate) in enumerate(kf): X_train, X_validate, label_train, label_validate = \ X[train_fold, :], X[validate, :], train_label[train_fold], train_label[validate] dtrain = lgbm.Dataset(X_train, label_train) dvalid = lgbm.Dataset(X_validate, label_validate, reference=dtrain) bst = lgbm.train(params, dtrain, num_boost_round, valid_sets=dvalid, feval=evalerror, verbose_eval=100, early_stopping_rounds=100) best_trees.append(bst.best_iteration) cv_pred += bst.predict(X_test, num_iteration=bst.best_iteration) cv_train[validate] += bst.predict(X_validate) score = Gini(label_validate, cv_train[validate]) print score fold_scores.append(score) cv_pred /= NFOLDS final_cv_train += cv_train final_cv_pred += cv_pred print("cv score:") print Gini(train_label, cv_train) print "current score:", Gini(train_label, final_cv_train / (s + 1.)), s+1 print(fold_scores) print(best_trees, np.mean(best_trees)) x_score.append(Gini(train_label, cv_train)) print(x_score) pred_result = pd.DataFrame({'id': test_id, 'target': final_cv_pred / 10.}) pred_result['target'] = pred_result['target'].rank(pct=True) pred_result.to_csv('../model/lgbm7_pred_avg.csv', index=False) cv_result = pd.DataFrame({'id': train_id, 'target': final_cv_train / 10.}) cv_result['target'] = cv_result['target'].rank(pct=True) cv_result.to_csv('../model/lgbm7_cv_avg.csv', index=False) #cv score: #0.287007087138 #current score: 0.289683837899 16 ================================================ FILE: code_for_exact_solution/lightgbm8.py ================================================ import numpy as np import scipy as sp from scipy import sparse as ssp import pandas as pd from sklearn.preprocessing import normalize from scipy.stats import spearmanr import lightgbm as lgb import lightgbm as lgbm from sklearn.model_selection import StratifiedKFold import numpy as np import pandas as pd from util import Gini from sklearn.preprocessing import LabelEncoder from itertools import combinations from util import proj_num_on_cat from sklearn.preprocessing import OneHotEncoder from scipy import sparse import lightgbm as lgbm from sklearn.model_selection import StratifiedKFold import numpy as np import pandas as pd from util import Gini from sklearn.preprocessing import LabelEncoder from itertools import combinations from util import proj_num_on_cat, cat_count from sklearn.preprocessing import OneHotEncoder from scipy import sparse cv_only = True save_cv = True full_train = False import numpy as np import scipy as sp from scipy import sparse as ssp import pandas as pd import xgboost as xgb from sklearn.model_selection import StratifiedKFold, KFold import numpy as np import pandas as pd from sklearn.preprocessing import LabelEncoder from itertools import combinations from util import proj_num_on_cat from sklearn.preprocessing import OneHotEncoder import category_encoders as ce from scipy import sparse from util import Gini, proj_num_on_cat, interaction_features, cat_count NFOLDS = 5 kfold = StratifiedKFold(n_splits=NFOLDS, shuffle=True, random_state=218) learning_rate = 0.1 num_leaves = 15 min_data_in_leaf = 2000 # max_bin = x feature_fraction = 0.6 num_boost_round = 10000 cv_only = True save_cv = True full_train = True def evalerror(preds, dtrain): labels = dtrain.get_label() return 'gini', Gini(labels, preds), True NFOLDS = 5 kfold = StratifiedKFold(n_splits=NFOLDS, shuffle=True, random_state=218) train = pd.read_csv("../input/train.csv") train_label = train['target'] train_id = train['id'] del train['target'], train['id'] test = pd.read_csv("../input/test.csv") test_id = test['id'] del test['id'] cat_fea = [x for x in list(train) if 'cat' in x] bin_fea = [x for x in list(train) if 'bin' in x] train['missing'] = (train==-1).sum(axis=1).astype(float) test['missing'] = (test==-1).sum(axis=1).astype(float) # include interactions for e, (x, y) in enumerate(combinations(['ps_car_13', 'ps_ind_03', 'ps_reg_03', 'ps_ind_15', 'ps_reg_01', 'ps_ind_01'], 2)): train, test = interaction_features(train, test, x, y, e) num_features = [c for c in list(train) if ('cat' not in c and 'calc' not in c)] num_features.append('missing') inter_fea = [x for x in list(train) if 'inter' in x] #train['cat_sum'] = train[cat_fea].sum(axis=1) #test['cat_sum'] = test[cat_fea].sum(axis=1) path = "../input/" num_features_comb = [] import os for p in os.listdir(path): if 'ps_reg_02___ps_car_07_cat' in p or 'ps_reg_01___ps_car_13___ps_car_15' in p: print(p) x,xt = pd.read_pickle(path+p) train[p] = x test[p] = xt num_features_comb.append(p) num_features += num_features_comb feature_names = list(train) ind_features = [c for c in feature_names if 'ind' in c] count = 0 for c in ind_features: if count == 0: train['new_ind'] = train[c].astype(str) count += 1 else: train['new_ind'] += '_' + train[c].astype(str) print(train['new_ind'].nunique()) ind_features = [c for c in feature_names if 'ind' in c] count = 0 for c in ind_features: if count == 0: test['new_ind'] = test[c].astype(str) count += 1 else: test['new_ind'] += '_' + test[c].astype(str) reg_features = [c for c in feature_names if 'reg' in c] count = 0 for c in reg_features: if count == 0: train['new_reg'] = train[c].astype(str) count += 1 else: train['new_reg'] += '_' + train[c].astype(str) print(train['new_reg'].nunique()) reg_features = [c for c in feature_names if 'reg' in c] count = 0 for c in reg_features: if count == 0: test['new_reg'] = test[c].astype(str) count += 1 else: test['new_reg'] += '_' + test[c].astype(str) car_features = [c for c in feature_names if 'car' in c] count = 0 for c in car_features: if count == 0: train['new_car'] = train[c].astype(str) count += 1 else: train['new_car'] += '_' + train[c].astype(str) print(train['new_car'].nunique()) car_features = [c for c in feature_names if 'car' in c] count = 0 for c in car_features: if count == 0: test['new_car'] = test[c].astype(str) count += 1 else: test['new_car'] += '_' + test[c].astype(str) new_ps_reg_03 = pd.read_pickle(path + 'new_ps_reg_03.pkl') train['ps_reg_03'] = new_ps_reg_03[:train.shape[0]] test['ps_reg_03'] = new_ps_reg_03[train.shape[0]:] print(train['ps_reg_03'].head(10)) #X = train.as_matrix() #X_test = test.as_matrix() #print(X.shape, X_test.shape) #ohe ohe = OneHotEncoder(sparse=True) train_cat = train[cat_fea].as_matrix() train_num = train[[x for x in list(train) if x in num_features]] test_cat = test[cat_fea].as_matrix() test_num = test[[x for x in list(train) if x in num_features]] train_cat[train_cat < 0] = 99 test_cat[test_cat < 0] = 99 traintest = np.vstack((train_cat, test_cat)) traintest = pd.DataFrame(traintest, columns=cat_fea) print(traintest.shape) #encoder = ce.HelmertEncoder(cols=cat_fea) #encoder.fit(traintest) #train_enc = encoder.transform(pd.DataFrame(train_cat, columns=cat_fea)) #test_enc = encoder.transform(pd.DataFrame(test_cat, columns=cat_fea)) ohe.fit(traintest) train_ohe = ohe.transform(train_cat) test_ohe = ohe.transform(test_cat) del traintest cat_count_features = [] for c in cat_fea + ['new_ind','new_reg','new_car']: d = pd.concat([train[c],test[c]]).value_counts().to_dict() train['%s_count'%c] = train[c].apply(lambda x:d.get(x,0)) test['%s_count'%c] = test[c].apply(lambda x:d.get(x,0)) cat_count_features.append('%s_count'%c) print(train_num.dtypes) train_list = [train_num.replace([np.inf, -np.inf, np.nan], 0), train_ohe, train[cat_count_features]]#, np.ones(shape=(train_num.shape[0], 1))] test_list = [test_num.replace([np.inf, -np.inf, np.nan], 0), test_ohe, test[cat_count_features]]#, np.ones(shape=(test_num.shape[0], 1))] #proj for t in ['ps_car_13', 'ps_ind_03', 'ps_reg_03', 'ps_ind_15', 'ps_reg_01', 'ps_ind_01']: for g in ['ps_car_13', 'ps_ind_03', 'ps_reg_03', 'ps_ind_15', 'ps_reg_01', 'ps_ind_01', 'ps_ind_05_cat']: if t != g: s_train, s_test = proj_num_on_cat(train, test, target_column=t, group_column=g) train_list.append(s_train) test_list.append(s_test) X = sparse.hstack(train_list).tocsr() X_test = sparse.hstack(test_list).tocsr() params = {"objective": "binary", "boosting_type": "gbdt", "learning_rate": learning_rate, "num_leaves": int(num_leaves), "max_bin": 256, "feature_fraction": feature_fraction, "verbosity": 0, "drop_rate": 0.1, "is_unbalance": False, "max_drop": 50, "min_child_samples": 10, "min_child_weight": 150, "min_split_gain": 0, "subsample": 0.9 } x_score = [] final_cv_train = np.zeros(len(train_label)) final_cv_pred = np.zeros(len(test_id)) for s in xrange(8): cv_train = np.zeros(len(train_label)) cv_pred = np.zeros(len(test_id)) params['seed'] = s if cv_only: kf = kfold.split(X, train_label) best_trees = [] fold_scores = [] for i, (train_fold, validate) in enumerate(kf): X_train, X_validate, label_train, label_validate = \ X[train_fold, :], X[validate, :], train_label[train_fold], train_label[validate] dtrain = lgbm.Dataset(X_train, label_train) dvalid = lgbm.Dataset(X_validate, label_validate, reference=dtrain) bst = lgbm.train(params, dtrain, num_boost_round, valid_sets=dvalid, feval=evalerror, verbose_eval=100, early_stopping_rounds=100) best_trees.append(bst.best_iteration) cv_pred += bst.predict(X_test, num_iteration=bst.best_iteration) cv_train[validate] += bst.predict(X_validate) score = Gini(label_validate, cv_train[validate]) print score fold_scores.append(score) cv_pred /= NFOLDS final_cv_train += cv_train final_cv_pred += cv_pred print("cv score:") print Gini(train_label, cv_train) print "current score:", Gini(train_label, final_cv_train / (s + 1.)), s+1 print(fold_scores) print(best_trees, np.mean(best_trees)) x_score.append(Gini(train_label, cv_train)) print(x_score) pd.DataFrame({'id': test_id, 'target': final_cv_pred / 16.}).to_csv('../model/lgbm8_pred_avg.csv', index=False) pd.DataFrame({'id': train_id, 'target': final_cv_train / 16.}).to_csv('../model/lgbm8_cv_avg.csv', index=False) #cv score: #0.287007087138 #current score: 0.289683837899 16 ================================================ FILE: code_for_exact_solution/logistic1.py ================================================ ''' simple xgboost benchmark ''' from sklearn.model_selection import StratifiedKFold, KFold from sklearn.feature_selection import SelectPercentile, f_classif import numpy as np import pandas as pd from util import Gini, proj_num_on_cat, interaction_features, cat_count from sklearn.preprocessing import LabelEncoder from itertools import combinations from util import proj_num_on_cat from sklearn.preprocessing import OneHotEncoder import category_encoders as ce from scipy import sparse from sklearn.linear_model import LogisticRegression as LR from sklearn.preprocessing import StandardScaler from scipy.sparse import csr_matrix import pickle cv_only = True save_cv = True full_train = True def evalerror(preds, dtrain): labels = dtrain.get_label() return 'gini', Gini(labels, preds) NFOLDS = 5 kfold = StratifiedKFold(n_splits=NFOLDS, shuffle=True, random_state=218) train = pd.read_csv("../input/train.csv") train_label = train['target'] train_id = train['id'] del train['target'], train['id'] test = pd.read_csv("../input/test.csv") test_id = test['id'] del test['id'] train_copy = train.copy() test_copy = test.copy() train_copy = train_copy.replace(-1, np.NaN) test_copy = test_copy.replace(-1, np.NaN) train['num_na'] = train_copy.isnull().sum(axis=1) test['num_na'] = test_copy.isnull().sum(axis=1) del train_copy, test_copy cat_fea = [x for x in list(train) if 'cat' in x] bin_fea = [x for x in list(train) if 'bin' in x] train_cat_count, test_cat_count = cat_count(train, test, cat_fea) # include interactions for e, (x, y) in enumerate(combinations(['ps_car_13', 'ps_ind_03', 'ps_reg_03', 'ps_ind_15', 'ps_reg_01', 'ps_ind_01'], 2)): train, test = interaction_features(train, test, x, y, e) inter_fea = [x for x in list(train) if 'inter' in x] #train['cat_sum'] = train[cat_fea].sum(axis=1) #test['cat_sum'] = test[cat_fea].sum(axis=1) #X = train.as_matrix() #X_test = test.as_matrix() #print(X.shape, X_test.shape) #ohe ohe = OneHotEncoder(sparse=True) train_cat = train[cat_fea].as_matrix() train_num = train[[x for x in list(train) if x not in cat_fea]] test_cat = test[cat_fea].as_matrix() test_num = test[[x for x in list(train) if x not in cat_fea]] train_cat[train_cat < 0] = 99 test_cat[test_cat < 0] = 99 traintest = np.vstack((train_cat, test_cat)) traintest = pd.DataFrame(traintest, columns=cat_fea) print(traintest.shape) #encoder = ce.HelmertEncoder(cols=cat_fea) #encoder.fit(traintest) #train_enc = encoder.transform(pd.DataFrame(train_cat, columns=cat_fea)) #test_enc = encoder.transform(pd.DataFrame(test_cat, columns=cat_fea)) ohe.fit(traintest) train_ohe = ohe.transform(train_cat) test_ohe = ohe.transform(test_cat) del traintest train_fea0, test_fea0 = pickle.load(open("../input/fea0.pk")) train_list = [train_num.replace([np.inf, -np.inf, np.nan], 0), train_ohe, train_fea0, train_cat_count]#, np.ones(shape=(train_num.shape[0], 1))] test_list = [test_num.replace([np.inf, -np.inf, np.nan], 0), test_ohe, test_fea0, test_cat_count]#, np.ones(shape=(test_num.shape[0], 1))] #proj for t in ['ps_car_13', 'ps_ind_03', 'ps_reg_03', 'ps_ind_15', 'ps_reg_01', 'ps_ind_01']: for g in ['ps_car_13', 'ps_ind_03', 'ps_reg_03', 'ps_ind_15', 'ps_reg_01', 'ps_ind_01', 'ps_ind_05_cat']: if t != g: s_train, s_test = proj_num_on_cat(train, test, target_column=t, group_column=g) train_list.append(s_train) test_list.append(s_test) X = sparse.hstack(train_list).tocsr() X_test = sparse.hstack(test_list).tocsr() #X = train_num #X_test = test_num #all_data = np.vstack([X, X_test]) selector = SelectPercentile(f_classif, 75) selector.fit(X.toarray(), train_label) X, X_test = selector.transform(X.toarray()), selector.transform(X_test.toarray()) all_data = np.vstack([X, X_test]) scaler = StandardScaler() scaler.fit(all_data) X = csr_matrix(scaler.transform(X)) X_test = csr_matrix(scaler.transform(X_test)) #X = scaler.transform(X) #X_test = scaler.transform(X_test) print(X.shape, X_test.shape) kf = kfold.split(X, train_label) cv_train = np.zeros(len(train_label)) cv_pred = np.zeros(len(test_id)) best_trees = [] fold_scores = [] if cv_only: for i, (train_fold, validate) in enumerate(kf): X_train, X_validate, label_train, label_validate = \ X[train_fold, :], X[validate, :], train_label[train_fold], train_label[validate] #selector = SelectPercentile(f_classif, X_train.toarray(), label_train) #X_train, X_validate = csr_matrix(selector.transform(X_train.toarray())), csr_matrix(selector.transform( # X_validate.toarray() #)) clf = LR(C=25.) clf.fit(X_train, label_train) cv_pred += clf.predict_proba(X_test)[:, 1] cv_train[validate] += clf.predict_proba(X_validate)[:, 1] score = Gini(label_validate, cv_train[validate]) print score fold_scores.append(score) print("cv score:") print Gini(train_label, cv_train) print(fold_scores) if save_cv: pd.DataFrame({'id': test_id, 'target': cv_pred/NFOLDS}).to_csv('../model/logistic1_pred.csv', index=False) pd.DataFrame({'id': train_id, 'target': cv_train}).to_csv('../model/logistic1_cv.csv', index=False) #cv score: #0.27906438918 #[0.28773918373071583, 0.26723600995806762, 0.28158062789785737, 0.27506435916773242, 0.28394519465855006] if full_train: clf = LR(C=25.) clf.fit(X, train_label) pred = clf.predict_proba(X_test)[:, 1] pd.DataFrame({'id': test_id, 'target': pred}).to_csv('../model/logistic1_full_pred.csv', index=False) ================================================ FILE: code_for_exact_solution/rank_average.py ================================================ import pandas as pd from util import Gini def get_rank(x): return pd.Series(x).rank(pct=True).values train = pd.read_csv("../input/train.csv", usecols = ['target']) keras3_train = pd.read_csv("../model/keras3_cv.csv") keras5_train = pd.read_csv("../model/keras5_cv.csv") keras6_train = pd.read_csv("../model/keras6_cv.csv") keras7_train = pd.read_csv("../model/keras7_cv.csv") lgbm1_train = pd.read_csv("../model/lgbm1_cv_avg.csv") lgbm3_train = pd.read_csv("../model/lgbm3_cv_avg.csv") lgbm8_train = pd.read_csv("../model/lgbm8_cv_avg.csv") lgbm5_train = pd.read_csv("../model/lgbm5_cv_avg.csv") lgbm6_train = pd.read_csv("../model/lgbm6_cv_avg.csv") lgbm7_train = pd.read_csv("../model/lgbm7_cv_avg.csv") logistic1_train = pd.read_csv("../model/logistic1_cv.csv") xgb0_train = pd.read_csv("../model/xgb0_cv.csv") keras3_test = pd.read_csv("../model/keras3_pred.csv") keras5_test = pd.read_csv("../model/keras5_pred.csv") keras6_test = pd.read_csv("../model/keras6_pred.csv") keras7_test = pd.read_csv("../model/keras7_pred.csv") lgbm1_test = pd.read_csv("../model/lgbm1_pred_avg.csv") lgbm3_test = pd.read_csv("../model/lgbm3_pred_avg.csv") lgbm8_test = pd.read_csv("../model/lgbm8_pred_avg.csv") lgbm5_test = pd.read_csv("../model/lgbm5_pred_avg.csv") lgbm6_test = pd.read_csv("../model/lgbm6_pred_avg.csv") lgbm7_test = pd.read_csv("../model/lgbm7_pred_avg.csv") logistic1_test = pd.read_csv("../model/logistic1_pred.csv") xgb0_test = pd.read_csv("../model/xgb0_pred.csv") xgblinear_train = pd.read_csv("../model/xgb0l_cv.csv") xgblinear_test = pd.read_csv("../model/xgb0l_pred.csv") result = get_rank(keras5_train['target']) * 0.4 + get_rank(lgbm3_train['target']) * 0.5 + \ get_rank(xgb0_train['target']) * 0.1 + get_rank(lgbm1_train['target']) * (-0.1) + \ get_rank(keras3_train['target']) * 0.1 + get_rank(logistic1_train['target']) * 0.1 + \ get_rank(xgblinear_train['target']) * 0.1 + get_rank(lgbm8_train['target']) * 0.25 + \ get_rank(lgbm5_train['target']) * 0.1 + \ get_rank(lgbm6_train['target']) * (-0.1) + get_rank(lgbm7_train['target']) * (0.1) + \ get_rank(keras6_train['target']) * (-0.1) + \ get_rank(keras7_train['target']) * 0.3 print "cv of final averaged model:", Gini(train['target'], result) result = get_rank(keras5_test['target']) * 0.4 + get_rank(lgbm3_test['target']) * 0.5 + \ get_rank(xgb0_test['target']) * 0.1 + get_rank(lgbm1_test['target']) * (-0.1) + \ get_rank(keras3_test['target']) * 0.1 + get_rank(logistic1_test['target']) * 0.1 + \ get_rank(xgblinear_test['target']) * 0.1 + get_rank(lgbm8_test['target']) * 0.25 + \ get_rank(lgbm5_test['target']) * 0.1 + \ get_rank(lgbm6_test['target']) * (-0.1) + get_rank(lgbm7_test['target']) * (0.1) + \ get_rank(keras6_test['target']) * (-0.1) + \ get_rank(keras7_test['target']) * 0.3 pd.DataFrame({'id': keras5_test['id'], 'target': get_rank(result)}).to_csv("../model/all_average.csv", index = False) ================================================ FILE: code_for_exact_solution/util.py ================================================ import numpy as np import pandas as pd def Gini(y_true, y_pred): # check and get number of samples assert y_true.shape == y_pred.shape n_samples = y_true.shape[0] # sort rows on prediction column # (from largest to smallest) arr = np.array([y_true, y_pred]).transpose() true_order = arr[arr[:, 0].argsort()][::-1, 0] pred_order = arr[arr[:, 1].argsort()][::-1, 0] # get Lorenz curves L_true = np.cumsum(true_order) * 1. / np.sum(true_order) L_pred = np.cumsum(pred_order) * 1. / np.sum(pred_order) L_ones = np.linspace(1 / n_samples, 1, n_samples) # get Gini coefficients (area between curves) G_true = np.sum(L_ones - L_true) G_pred = np.sum(L_ones - L_pred) # normalize to true Gini coefficient return G_pred * 1. / G_true def cat_count(train_df, test_df, cat_list): train_df['row_id'] = range(train_df.shape[0]) test_df['row_id'] = range(test_df.shape[0]) train_df['train'] = 1 test_df['train'] = 0 all_df = train_df[['row_id', 'train'] + cat_list].append(test_df[['row_id','train'] + cat_list]) for e, cat in enumerate(cat_list): grouped = all_df[[cat]].groupby(cat) the_size = pd.DataFrame(grouped.size()).reset_index() the_size.columns = [cat, '{}_size'.format(cat)] all_df = pd.merge(all_df, the_size, how='left') selected_train = all_df[all_df['train'] == 1] selected_test = all_df[all_df['train'] == 0] selected_train.sort_values('row_id', inplace=True) selected_test.sort_values('row_id', inplace=True) selected_train.drop(['row_id', 'train'] + cat_list, axis=1, inplace=True) selected_test.drop(['row_id', 'train'] + cat_list, axis=1, inplace=True) selected_train, selected_test = np.array(selected_train), np.array(selected_test) print(selected_train.shape, selected_test.shape) return selected_train, selected_test def proj_num_on_cat(train_df, test_df, target_column, group_column): """ :param train_df: train data frame :param test_df: test data frame :param target_column: name of numerical feature :param group_column: name of categorical feature """ train_df['row_id'] = range(train_df.shape[0]) test_df['row_id'] = range(test_df.shape[0]) train_df['train'] = 1 test_df['train'] = 0 all_df = train_df[['row_id', 'train', target_column, group_column]].append(test_df[['row_id','train', target_column, group_column]]) grouped = all_df[[target_column, group_column]].groupby(group_column) the_size = pd.DataFrame(grouped.size()).reset_index() the_size.columns = [group_column, '%s_size' % target_column] the_mean = pd.DataFrame(grouped.mean()).reset_index() the_mean.columns = [group_column, '%s_mean' % target_column] the_std = pd.DataFrame(grouped.std()).reset_index().fillna(0) the_std.columns = [group_column, '%s_std' % target_column] the_median = pd.DataFrame(grouped.median()).reset_index() the_median.columns = [group_column, '%s_median' % target_column] the_stats = pd.merge(the_size, the_mean) the_stats = pd.merge(the_stats, the_std) the_stats = pd.merge(the_stats, the_median) the_max = pd.DataFrame(grouped.max()).reset_index() the_max.columns = [group_column, '%s_max' % target_column] the_min = pd.DataFrame(grouped.min()).reset_index() the_min.columns = [group_column, '%s_min' % target_column] the_stats = pd.merge(the_stats, the_max) the_stats = pd.merge(the_stats, the_min) all_df = pd.merge(all_df, the_stats, how='left') selected_train = all_df[all_df['train'] == 1] selected_test = all_df[all_df['train'] == 0] selected_train.sort_values('row_id', inplace=True) selected_test.sort_values('row_id', inplace=True) selected_train.drop([target_column, group_column, 'row_id', 'train'], axis=1, inplace=True) selected_test.drop([target_column, group_column, 'row_id', 'train'], axis=1, inplace=True) selected_train, selected_test = np.array(selected_train), np.array(selected_test) print(selected_train.shape, selected_test.shape) return selected_train, selected_test def interaction_features(train, test, fea1, fea2, prefix): train['inter_{}*'.format(prefix)] = train[fea1] * train[fea2] train['inter_{}/'.format(prefix)] = train[fea1] / train[fea2] test['inter_{}*'.format(prefix)] = test[fea1] * test[fea2] test['inter_{}/'.format(prefix)] = test[fea1] / test[fea2] return train, test ================================================ FILE: code_for_exact_solution/xgb0.py ================================================ ''' simple xgboost benchmark ''' import xgboost as xgb from sklearn.model_selection import StratifiedKFold, KFold import numpy as np import pandas as pd from util import Gini from sklearn.preprocessing import LabelEncoder from itertools import combinations from util import proj_num_on_cat from sklearn.preprocessing import OneHotEncoder import category_encoders as ce from scipy import sparse cv_only = True save_cv = True full_train = False def evalerror(preds, dtrain): labels = dtrain.get_label() return 'gini', Gini(labels, preds) NFOLDS = 5 kfold = StratifiedKFold(n_splits=NFOLDS, shuffle=True, random_state=218) #kfold = KFold(n_splits=NFOLDS, shuffle=True, random_state=0) eta = 0.05 max_depth = 7 subsample = 0.97 colsample_bytree = 0.85 gamma = 0.05 alpha = 0 min_child_weight = 55 #lamb = 0.35 colsample_bylevel = 0.8 num_boost_round = 10000 train = pd.read_csv("../input/train.csv") train_label = train['target'] train_id = train['id'] del train['target'], train['id'] test = pd.read_csv("../input/test.csv") test_id = test['id'] del test['id'] train_copy = train.copy() test_copy = test.copy() train_copy = train_copy.replace(-1, np.NaN) test_copy = test_copy.replace(-1, np.NaN) train['num_na'] = train_copy.isnull().sum(axis=1) test['num_na'] = test_copy.isnull().sum(axis=1) del train_copy, test_copy cat_fea = [x for x in list(train) if 'cat' in x] bin_fea = [x for x in list(train) if 'bin' in x] #train['cat_sum'] = train[cat_fea].sum(axis=1) #test['cat_sum'] = test[cat_fea].sum(axis=1) #X = train.as_matrix() #X_test = test.as_matrix() #print(X.shape, X_test.shape) #ohe ohe = OneHotEncoder(sparse=True) cat_fea = [x for x in list(train) if 'cat' in x] train_cat = train[cat_fea].as_matrix() train_num = train[[x for x in list(train) if x not in cat_fea]] test_cat = test[cat_fea].as_matrix() test_num = test[[x for x in list(train) if x not in cat_fea]] train_cat[train_cat < 0] = 99 test_cat[test_cat < 0] = 99 traintest = np.vstack((train_cat, test_cat)) traintest = pd.DataFrame(traintest, columns=cat_fea) print(traintest.shape) #encoder = ce.HelmertEncoder(cols=cat_fea) #encoder.fit(traintest) #train_enc = encoder.transform(pd.DataFrame(train_cat, columns=cat_fea)) #test_enc = encoder.transform(pd.DataFrame(test_cat, columns=cat_fea)) ohe.fit(traintest) train_ohe = ohe.transform(train_cat) test_ohe = ohe.transform(test_cat) del traintest train_list = [train_num, train_ohe]#, np.ones(shape=(train_num.shape[0], 1))] test_list = [test_num, test_ohe]#, np.ones(shape=(test_num.shape[0], 1))] X = sparse.hstack(train_list).tocsr() X_test = sparse.hstack(test_list).tocsr() #X, X_test = X.toarray(), X_test.toarray() print(X.shape, X_test.shape) final_cv_train = np.zeros(len(train_label)) final_cv_pred = np.zeros(len(test_id)) final_best_trees = [] params = {"objective": "binary:logistic", "booster": "gbtree", "eta": eta, "max_depth": int(max_depth), "subsample": subsample, "colsample_bytree": colsample_bytree, "gamma": gamma, #"lamb": lamb, "alpha": alpha, "min_child_weight": min_child_weight, "colsample_bylevel": colsample_bylevel, "silent": 1 } if cv_only: num_seeds = 24 for s in xrange(num_seeds): print(s) params['seed'] = s kf = kfold.split(X, train_label) cv_train = np.zeros(len(train_label)) cv_pred = np.zeros(len(test_id)) best_trees = [] fold_scores = [] for i, (train_fold, validate) in enumerate(kf): X_train, X_validate, label_train, label_validate = \ X[train_fold, :], X[validate, :], train_label[train_fold], train_label[validate] dtrain = xgb.DMatrix(X_train, label_train) dvalid = xgb.DMatrix(X_validate, label_validate) watchlist = [(dtrain, 'train'), (dvalid, 'valid')] bst = xgb.train(params, dtrain, num_boost_round, evals=watchlist, feval=evalerror, verbose_eval=800, early_stopping_rounds=25, maximize=True) best_trees.append(bst.best_iteration) cv_pred += bst.predict(xgb.DMatrix(X_test)) cv_train[validate] += bst.predict(xgb.DMatrix(X_validate), ntree_limit=bst.best_ntree_limit) score = Gini(label_validate, cv_train[validate]) print score fold_scores.append(score) final_cv_train += cv_train final_cv_pred += cv_pred final_best_trees += best_trees print("cv score:") print Gini(train_label, cv_train) print(fold_scores) print(best_trees, np.mean(best_trees)) print("current score:", Gini(train_label, final_cv_train * 1. / (s + 1)), s+1) final_cv_pred /= (NFOLDS * num_seeds) final_cv_train /= num_seeds pd.DataFrame({'id': test_id, 'target': final_cv_pred}).to_csv('../model/xgb_avg16_pred.csv', index=False) pd.DataFrame({'id': train_id, 'target': final_cv_train}).to_csv('../model/xgb_avg16_cv.csv', index=False) print(np.mean(final_best_trees), np.median(final_best_trees), np.std(final_best_trees)) ## 0.1 #0.281739276885 #[0.28693135981084533, 0.26989064676756958, 0.28035898856108521, 0.28178381987103512, 0.29021910168396381] #([123, 91, 139, 97, 92], 108.40000000000001) #0.284350552387 #([1057, 933, 1175, 979, 1168], 1062.4000000000001) if full_train: for s in xrange(32): params['seed'] = s dtrain = xgb.DMatrix(X, train_label) watchlist = [(dtrain, 'train')] bst = xgb.train(params, dtrain, 100, evals=watchlist, feval=evalerror, verbose_eval=50, maximize=True) pred = bst.predict(xgb.DMatrix(X_test)) if s == 0: final_pred = pred else: final_pred += pred pd.DataFrame({'id': test_id, 'target': final_pred / 32.}).to_csv('../model/xgb_avg_full_pred.csv', index=False) ================================================ FILE: code_for_exact_solution/xgb_linear0.py ================================================ import os import sys import operator import numpy as np import pandas as pd from scipy import sparse import xgboost as xgb from sklearn import model_selection, preprocessing, ensemble from keras.models import Sequential from keras.layers import Dense, Dropout, Activation from keras.layers.normalization import BatchNormalization from keras.layers.advanced_activations import PReLU from sklearn.preprocessing import OneHotEncoder from sklearn.model_selection import StratifiedKFold from time import time import datetime from util import Gini, proj_num_on_cat, interaction_features, cat_count from sklearn.preprocessing import StandardScaler from itertools import combinations import pickle ''' simple xgboost benchmark ''' from sklearn.model_selection import StratifiedKFold, KFold from sklearn.feature_selection import SelectPercentile, f_classif import numpy as np import pandas as pd from util import Gini, proj_num_on_cat, interaction_features, cat_count from sklearn.preprocessing import LabelEncoder from itertools import combinations from util import proj_num_on_cat from sklearn.preprocessing import OneHotEncoder import category_encoders as ce from scipy import sparse from sklearn.linear_model import LogisticRegression as LR from sklearn.preprocessing import StandardScaler from scipy.sparse import csr_matrix import pickle cv_only = True save_cv = True full_train = True def evalerror(preds, dtrain): labels = dtrain.get_label() return 'gini', Gini(labels, preds) NFOLDS = 5 kfold = StratifiedKFold(n_splits=NFOLDS, shuffle=True, random_state=218) train = pd.read_csv("../input/train.csv") train_label = train['target'] train_id = train['id'] del train['target'], train['id'] test = pd.read_csv("../input/test.csv") test_id = test['id'] del test['id'] cat_fea = [x for x in list(train) if 'cat' in x] bin_fea = [x for x in list(train) if 'bin' in x] train['missing'] = (train==-1).sum(axis=1).astype(float) test['missing'] = (test==-1).sum(axis=1).astype(float) # include interactions for e, (x, y) in enumerate(combinations(['ps_car_13', 'ps_ind_03', 'ps_reg_03', 'ps_ind_15', 'ps_reg_01', 'ps_ind_01'], 2)): train, test = interaction_features(train, test, x, y, e) num_features = [c for c in list(train) if ('cat' not in c and 'calc' not in c)] num_features.append('missing') inter_fea = [x for x in list(train) if 'inter' in x] #train['cat_sum'] = train[cat_fea].sum(axis=1) #test['cat_sum'] = test[cat_fea].sum(axis=1) #X = train.as_matrix() #X_test = test.as_matrix() #print(X.shape, X_test.shape) #ohe ohe = OneHotEncoder(sparse=True) train_cat = train[cat_fea].as_matrix() train_num = train[[x for x in list(train) if x in num_features]] test_cat = test[cat_fea].as_matrix() test_num = test[[x for x in list(train) if x in num_features]] train_cat[train_cat < 0] = 99 test_cat[test_cat < 0] = 99 traintest = np.vstack((train_cat, test_cat)) traintest = pd.DataFrame(traintest, columns=cat_fea) print(traintest.shape) #encoder = ce.HelmertEncoder(cols=cat_fea) #encoder.fit(traintest) #train_enc = encoder.transform(pd.DataFrame(train_cat, columns=cat_fea)) #test_enc = encoder.transform(pd.DataFrame(test_cat, columns=cat_fea)) ohe.fit(traintest) train_ohe = ohe.transform(train_cat) test_ohe = ohe.transform(test_cat) del traintest train_fea0, test_fea0 = pickle.load(open("../input/fea0.pk")) train_fea1, test_fea1 = pickle.load(open("../input/fea0_lgb.pk")) cat_count_features = [] for c in cat_fea: d = pd.concat([train[c],test[c]]).value_counts().to_dict() train['%s_count'%c] = train[c].apply(lambda x:d.get(x,0)) test['%s_count'%c] = test[c].apply(lambda x:d.get(x,0)) cat_count_features.append('%s_count'%c) train_list = [train_num.replace([np.inf, -np.inf, np.nan], 0), train_ohe, train[cat_count_features]]#, np.ones(shape=(train_num.shape[0], 1))] test_list = [test_num.replace([np.inf, -np.inf, np.nan], 0), test_ohe, test[cat_count_features]]#, np.ones(shape=(test_num.shape[0], 1))] t_fea = ['ps_car_13', 'ps_ind_03', 'ps_reg_03', 'ps_ind_15', 'ps_reg_01', 'ps_ind_01'] g_fea = ['ps_car_13', 'ps_ind_03', 'ps_reg_03', 'ps_ind_15', 'ps_reg_01', 'ps_ind_01', 'ps_ind_05_cat'] + cat_fea t_fea = list(set(t_fea)) g_fea = list(set(g_fea)) #proj for t in t_fea: for g in g_fea: if t != g: s_train, s_test = proj_num_on_cat(train, test, target_column=t, group_column=g) train_list.append(s_train) test_list.append(s_test) X = sparse.hstack(train_list).tocsr() X_test = sparse.hstack(test_list).tocsr() #X = train_num #X_test = test_num all_data = np.vstack([X.toarray(), X_test.toarray()]) #all_data = np.vstack([X, X_test]) scaler = StandardScaler() scaler.fit(all_data) X = scaler.transform(X.toarray()) X_test = scaler.transform(X_test.toarray()) #X = scaler.transform(X) #X_test = scaler.transform(X_test) print(X.shape, X_test.shape) final_cv_train = np.zeros(len(train_label)) final_cv_pred = np.zeros(len(test_id)) final_best_trees = [] eta = 0.1 lamb = 0.25 alpha = 1 num_boost_round = 10000 params = {"objective": "binary:logistic", "booster": "gbtree", "eta": eta, "lamb": lamb, "alpha": alpha, "silent": 1 } if cv_only: num_seeds = 3 for s in xrange(num_seeds): print(s) params['seed'] = s kf = kfold.split(X, train_label) cv_train = np.zeros(len(train_label)) cv_pred = np.zeros(len(test_id)) best_trees = [] fold_scores = [] for i, (train_fold, validate) in enumerate(kf): X_train, X_validate, label_train, label_validate = \ X[train_fold, :], X[validate, :], train_label[train_fold], train_label[validate] dtrain = xgb.DMatrix(X_train, label_train) dvalid = xgb.DMatrix(X_validate, label_validate) watchlist = [(dtrain, 'train'), (dvalid, 'valid')] bst = xgb.train(params, dtrain, num_boost_round, evals=watchlist, feval=evalerror, verbose_eval=100, early_stopping_rounds=50, maximize=True) best_trees.append(bst.best_iteration) cv_pred += bst.predict(xgb.DMatrix(X_test)) cv_train[validate] += bst.predict(xgb.DMatrix(X_validate), ntree_limit=bst.best_ntree_limit) score = Gini(label_validate, cv_train[validate]) print score fold_scores.append(score) final_cv_train += cv_train final_cv_pred += cv_pred final_best_trees += best_trees print("cv score:") print Gini(train_label, cv_train) print(fold_scores) print(best_trees, np.mean(best_trees)) print("current score:", Gini(train_label, final_cv_train * 1. / (s + 1)), s+1) final_cv_pred /= (NFOLDS * num_seeds) final_cv_train /= num_seeds pd.DataFrame({'id': test_id, 'target': final_cv_pred}).to_csv('../model/xgb0l_pred.csv', index=False) pd.DataFrame({'id': train_id, 'target': final_cv_train}).to_csv('../model/xgb0l_cv.csv', index=False) print(np.mean(final_best_trees), np.median(final_best_trees), np.std(final_best_trees)) ================================================ FILE: input/readme ================================================ unzipped data goes here ================================================ FILE: model/readme ================================================ models will be saved here ================================================ FILE: readme.md ================================================ ### Requirements *older or newer version of below packages should theoretically work fine* python 2.7 numpy 1.13.3 pandas 0.20.3 sklearn 0.19.1 keras 2.1.1 tensorflow 1.4.0 xgboost 0.6 lightgbm 2.0.10 ### How to reproduce #### Simple solution (recommended) Put unzipped data in `input` *Generate a simple solution that is good enough for 2nd place (~0.2938 on private LB)* `cd code` `python fea_eng0.py` `python nn_model290.py` to get a nn model that scores 0.290X `python gbm_model291.py` to get a gbm model that scores 0.291X `python simple_average.py` and then you can find the submission file at `../model/simple_average.csv` You can reproduce this solution in a few hours. #### Exact solution (Optional) Although not recommended but you can also reproduce the exact same solution we submitted (0.29413 on private LB). *you can follow these steps below, in addition to the simple solution* ``` cd ../code_for_exact_solution/ python keras3.py python keras6.py python keras7.py python lightgbm1.py python lightgbm5.py python lightgbm6.py python lightgbm7.py python lightgbm8.py python logistic1.py python xgb0.py python xgb_linear0.py python rank_average.py ``` *It can take up to 2 days to generate the exact solution which only has 0.0003 improvement over the simple one*