Showing preview only (210K chars total). Download the full file or copy to clipboard to get everything.
Repository: cerlymarco/shap-hypetune
Branch: main
Commit: 8f46e161d27a
Files: 11
Total size: 203.6 KB
Directory structure:
gitextract_liu6w4f0/
├── .gitignore
├── LICENSE
├── README.md
├── notebooks/
│ ├── LGBM_usage.ipynb
│ └── XGBoost_usage.ipynb
├── requirements.txt
├── setup.py
└── shaphypetune/
├── __init__.py
├── _classes.py
├── shaphypetune.py
└── utils.py
================================================
FILE CONTENTS
================================================
================================================
FILE: .gitignore
================================================
.DS_Store
# Created by https://www.gitignore.io/api/python
### Python ###
# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
*$py.class
# C extensions
*.so
# Distribution / packaging
.Python
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
wheels/
*.egg-info/
.installed.cfg
*.egg
# PyInstaller
# Usually these files are written by a python script from a template
# before PyInstaller builds the exe, so as to inject date/other infos into it.
*.manifest
*.spec
# Installer logs
pip-log.txt
pip-delete-this-directory.txt
# Unit test / coverage reports
htmlcov/
.tox/
.coverage
.coverage.*
.cache
nosetests.xml
coverage.xml
*.cover
.hypothesis/
# Translations
*.mo
*.pot
# Django stuff:
*.log
local_settings.py
# Flask stuff:
instance/
.webassets-cache
# Scrapy stuff:
.scrapy
# Sphinx documentation
docs/_build/
# PyBuilder
target/
# Jupyter Notebook
.ipynb_checkpoints
# pyenv
.python-version
# celery beat schedule file
celerybeat-schedule
# SageMath parsed files
*.sage.py
# Environments
.env
.venv
env/
venv/
ENV/
env.bak/
venv.bak/
# Spyder project settings
.spyderproject
.spyproject
# Rope project settings
.ropeproject
# mkdocs documentation
/site
# mypy
.mypy_cache/
# End of https://www.gitignore.io/api/python
================================================
FILE: LICENSE
================================================
MIT License
Copyright (c) 2021 Marco Cerliani
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
================================================
FILE: README.md
================================================
# shap-hypetune
A python package for simultaneous Hyperparameters Tuning and Features Selection for Gradient Boosting Models.

## Overview
Hyperparameters tuning and features selection are two common steps in every machine learning pipeline. Most of the time they are computed separately and independently. This may result in suboptimal performances and in a more time expensive process.
shap-hypetune aims to combine hyperparameters tuning and features selection in a single pipeline optimizing the optimal number of features while searching for the optimal parameters configuration. Hyperparameters Tuning or Features Selection can also be carried out as standalone operations.
**shap-hypetune main features:**
- designed for gradient boosting models, as LGBModel or XGBModel;
- developed to be integrable with the scikit-learn ecosystem;
- effective in both classification or regression tasks;
- customizable training process, supporting early-stopping and all the other fitting options available in the standard algorithms api;
- ranking feature selection algorithms: Recursive Feature Elimination (RFE); Recursive Feature Addition (RFA); or Boruta;
- classical boosting based feature importances or SHAP feature importances (the later can be computed also on the eval_set);
- apply grid-search, random-search, or bayesian-search (from hyperopt);
- parallelized computations with joblib.
## Installation
```shell
pip install --upgrade shap-hypetune
```
lightgbm, xgboost are not needed requirements. The module depends only on NumPy, shap, scikit-learn and hyperopt. Python 3.6 or above is supported.
## Media
- [SHAP for Feature Selection and HyperParameter Tuning](https://towardsdatascience.com/shap-for-feature-selection-and-hyperparameter-tuning-a330ec0ea104)
- [Boruta and SHAP for better Feature Selection](https://towardsdatascience.com/boruta-and-shap-for-better-feature-selection-20ea97595f4a)
- [Recursive Feature Selection: Addition or Elimination?](https://towardsdatascience.com/recursive-feature-selection-addition-or-elimination-755e5d86a791)
- [Boruta SHAP for Temporal Feature Selection](https://towardsdatascience.com/boruta-shap-for-temporal-feature-selection-96a7840c7713)
## Usage
```python
from shaphypetune import BoostSearch, BoostRFE, BoostRFA, BoostBoruta
```
#### Hyperparameters Tuning
```python
BoostSearch(
estimator, # LGBModel or XGBModel
param_grid=None, # parameters to be optimized
greater_is_better=False, # minimize or maximize the monitored score
n_iter=None, # number of sampled parameter configurations
sampling_seed=None, # the seed used for parameter sampling
verbose=1, # verbosity mode
n_jobs=None # number of jobs to run in parallel
)
```
#### Feature Selection (RFE)
```python
BoostRFE(
estimator, # LGBModel or XGBModel
min_features_to_select=None, # the minimum number of features to be selected
step=1, # number of features to remove at each iteration
param_grid=None, # parameters to be optimized
greater_is_better=False, # minimize or maximize the monitored score
importance_type='feature_importances', # which importance measure to use: default or shap
train_importance=True, # where to compute the shap feature importance
n_iter=None, # number of sampled parameter configurations
sampling_seed=None, # the seed used for parameter sampling
verbose=1, # verbosity mode
n_jobs=None # number of jobs to run in parallel
)
```
#### Feature Selection (BORUTA)
```python
BoostBoruta(
estimator, # LGBModel or XGBModel
perc=100, # threshold used to compare shadow and real features
alpha=0.05, # p-value levels for feature rejection
max_iter=100, # maximum Boruta iterations to perform
early_stopping_boruta_rounds=None, # maximum iterations without confirming a feature
param_grid=None, # parameters to be optimized
greater_is_better=False, # minimize or maximize the monitored score
importance_type='feature_importances', # which importance measure to use: default or shap
train_importance=True, # where to compute the shap feature importance
n_iter=None, # number of sampled parameter configurations
sampling_seed=None, # the seed used for parameter sampling
verbose=1, # verbosity mode
n_jobs=None # number of jobs to run in parallel
)
```
#### Feature Selection (RFA)
```python
BoostRFA(
estimator, # LGBModel or XGBModel
min_features_to_select=None, # the minimum number of features to be selected
step=1, # number of features to remove at each iteration
param_grid=None, # parameters to be optimized
greater_is_better=False, # minimize or maximize the monitored score
importance_type='feature_importances', # which importance measure to use: default or shap
train_importance=True, # where to compute the shap feature importance
n_iter=None, # number of sampled parameter configurations
sampling_seed=None, # the seed used for parameter sampling
verbose=1, # verbosity mode
n_jobs=None # number of jobs to run in parallel
)
```
Full examples in the [notebooks folder](https://github.com/cerlymarco/shap-hypetune/tree/main/notebooks).
================================================
FILE: notebooks/LGBM_usage.ipynb
================================================
{"metadata":{"kernelspec":{"language":"python","display_name":"Python 3","name":"python3"},"language_info":{"name":"python","version":"3.7.12","mimetype":"text/x-python","codemirror_mode":{"name":"ipython","version":3},"pygments_lexer":"ipython3","nbconvert_exporter":"python","file_extension":".py"}},"nbformat_minor":4,"nbformat":4,"cells":[{"cell_type":"code","source":"import numpy as np\nimport pandas as pd\nfrom scipy import stats\n\nfrom sklearn.model_selection import train_test_split\nfrom sklearn.datasets import make_classification, make_regression\n\nfrom hyperopt import hp\nfrom hyperopt import Trials\n\nfrom lightgbm import *\n\ntry:\n from shaphypetune import BoostSearch, BoostBoruta, BoostRFE, BoostRFA\nexcept:\n !pip install --upgrade shap-hypetune\n from shaphypetune import BoostSearch, BoostBoruta, BoostRFE, BoostRFA\n\nimport warnings\nwarnings.simplefilter('ignore')","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:46:43.363945Z","iopub.execute_input":"2022-01-01T11:46:43.364356Z","iopub.status.idle":"2022-01-01T11:46:45.084134Z","shell.execute_reply.started":"2022-01-01T11:46:43.364243Z","shell.execute_reply":"2022-01-01T11:46:45.083177Z"},"trusted":true},"execution_count":1,"outputs":[{"output_type":"display_data","data":{"text/plain":"<IPython.core.display.HTML object>","text/html":"<style type='text/css'>\n.datatable table.frame { margin-bottom: 0; }\n.datatable table.frame thead { border-bottom: none; }\n.datatable table.frame tr.coltypes td { color: #FFFFFF; line-height: 6px; padding: 0 0.5em;}\n.datatable .bool { background: #DDDD99; }\n.datatable .object { background: #565656; }\n.datatable .int { background: #5D9E5D; }\n.datatable .float { background: #4040CC; }\n.datatable .str { background: #CC4040; }\n.datatable .time { background: #40CC40; }\n.datatable .row_index { background: var(--jp-border-color3); border-right: 1px solid var(--jp-border-color0); color: var(--jp-ui-font-color3); font-size: 9px;}\n.datatable .frame tbody td { text-align: left; }\n.datatable .frame tr.coltypes .row_index { background: var(--jp-border-color0);}\n.datatable th:nth-child(2) { padding-left: 12px; }\n.datatable .hellipsis { color: var(--jp-cell-editor-border-color);}\n.datatable .vellipsis { background: var(--jp-layout-color0); color: var(--jp-cell-editor-border-color);}\n.datatable .na { color: var(--jp-cell-editor-border-color); font-size: 80%;}\n.datatable .sp { opacity: 0.25;}\n.datatable .footer { font-size: 9px; }\n.datatable .frame_dimensions { background: var(--jp-border-color3); border-top: 1px solid var(--jp-border-color0); color: var(--jp-ui-font-color3); display: inline-block; opacity: 0.6; padding: 1px 10px 1px 5px;}\n</style>\n"},"metadata":{}}]},{"cell_type":"code","source":"X_clf, y_clf = make_classification(n_samples=6000, n_features=20, n_classes=2, \n n_informative=4, n_redundant=6, random_state=0)\n\nX_clf_train, X_clf_valid, y_clf_train, y_clf_valid = train_test_split(\n X_clf, y_clf, test_size=0.3, shuffle=False)\n\nX_regr, y_regr = make_classification(n_samples=6000, n_features=20,\n n_informative=7, random_state=0)\n\nX_regr_train, X_regr_valid, y_regr_train, y_regr_valid = train_test_split(\n X_regr, y_regr, test_size=0.3, shuffle=False)","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:46:45.086875Z","iopub.execute_input":"2022-01-01T11:46:45.087123Z","iopub.status.idle":"2022-01-01T11:46:45.118700Z","shell.execute_reply.started":"2022-01-01T11:46:45.087094Z","shell.execute_reply":"2022-01-01T11:46:45.117983Z"},"trusted":true},"execution_count":2,"outputs":[]},{"cell_type":"code","source":"param_grid = {\n 'learning_rate': [0.2, 0.1],\n 'num_leaves': [25, 35],\n 'max_depth': [10, 12]\n}\n\nparam_dist = {\n 'learning_rate': stats.uniform(0.09, 0.25),\n 'num_leaves': stats.randint(20,40),\n 'max_depth': [10, 12]\n}\n\nparam_dist_hyperopt = {\n 'max_depth': 15 + hp.randint('num_leaves', 5), \n 'learning_rate': hp.loguniform('learning_rate', np.log(0.01), np.log(0.2)),\n 'colsample_bytree': hp.uniform('colsample_by_tree', 0.6, 1.0)\n}\n\n\nregr_lgbm = LGBMRegressor(n_estimators=150, random_state=0, n_jobs=-1)\nclf_lgbm = LGBMClassifier(n_estimators=150, random_state=0, n_jobs=-1)","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:46:45.120073Z","iopub.execute_input":"2022-01-01T11:46:45.120376Z","iopub.status.idle":"2022-01-01T11:46:45.132838Z","shell.execute_reply.started":"2022-01-01T11:46:45.120336Z","shell.execute_reply":"2022-01-01T11:46:45.131615Z"},"trusted":true},"execution_count":3,"outputs":[]},{"cell_type":"markdown","source":"# Hyperparameters Tuning","metadata":{}},{"cell_type":"code","source":"### HYPERPARAM TUNING WITH GRID-SEARCH ###\n\nmodel = BoostSearch(clf_lgbm, param_grid=param_grid)\nmodel.fit(X_clf_train, y_clf_train, eval_set=[(X_clf_valid, y_clf_valid)], early_stopping_rounds=6, verbose=0)","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:46:45.134450Z","iopub.execute_input":"2022-01-01T11:46:45.135435Z","iopub.status.idle":"2022-01-01T11:46:46.383589Z","shell.execute_reply.started":"2022-01-01T11:46:45.135389Z","shell.execute_reply":"2022-01-01T11:46:46.382860Z"},"trusted":true},"execution_count":4,"outputs":[{"name":"stdout","text":"\n8 trials detected for ('learning_rate', 'num_leaves', 'max_depth')\n\ntrial: 0001 ### iterations: 00023 ### eval_score: 0.2085\ntrial: 0002 ### iterations: 00019 ### eval_score: 0.21112\ntrial: 0003 ### iterations: 00026 ### eval_score: 0.21162\ntrial: 0004 ### iterations: 00032 ### eval_score: 0.20747\ntrial: 0005 ### iterations: 00054 ### eval_score: 0.20244\ntrial: 0006 ### iterations: 00071 ### eval_score: 0.20052\ntrial: 0007 ### iterations: 00047 ### eval_score: 0.20306\ntrial: 0008 ### iterations: 00050 ### eval_score: 0.20506\n","output_type":"stream"},{"execution_count":4,"output_type":"execute_result","data":{"text/plain":"BoostSearch(estimator=LGBMClassifier(n_estimators=150, random_state=0),\n param_grid={'learning_rate': [0.2, 0.1], 'max_depth': [10, 12],\n 'num_leaves': [25, 35]})"},"metadata":{}}]},{"cell_type":"code","source":"model.estimator_, model.best_params_, model.best_score_","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:46:46.388010Z","iopub.execute_input":"2022-01-01T11:46:46.389926Z","iopub.status.idle":"2022-01-01T11:46:46.397550Z","shell.execute_reply.started":"2022-01-01T11:46:46.389888Z","shell.execute_reply":"2022-01-01T11:46:46.396658Z"},"trusted":true},"execution_count":5,"outputs":[{"execution_count":5,"output_type":"execute_result","data":{"text/plain":"(LGBMClassifier(max_depth=12, n_estimators=150, num_leaves=25, random_state=0),\n {'learning_rate': 0.1, 'num_leaves': 25, 'max_depth': 12},\n 0.20051586840398297)"},"metadata":{}}]},{"cell_type":"code","source":"(model.score(X_clf_valid, y_clf_valid), \n model.predict(X_clf_valid).shape, \n model.predict_proba(X_clf_valid).shape)","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:46:46.398765Z","iopub.execute_input":"2022-01-01T11:46:46.399534Z","iopub.status.idle":"2022-01-01T11:46:46.436761Z","shell.execute_reply.started":"2022-01-01T11:46:46.399498Z","shell.execute_reply":"2022-01-01T11:46:46.431623Z"},"trusted":true},"execution_count":6,"outputs":[{"execution_count":6,"output_type":"execute_result","data":{"text/plain":"(0.9183333333333333, (1800,), (1800, 2))"},"metadata":{}}]},{"cell_type":"code","source":"### HYPERPARAM TUNING WITH RANDOM-SEARCH ###\n\nmodel = BoostSearch(\n regr_lgbm, param_grid=param_dist,\n n_iter=8, sampling_seed=0\n)\nmodel.fit(X_regr_train, y_regr_train, eval_set=[(X_regr_valid, y_regr_valid)], early_stopping_rounds=6, verbose=0)","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:46:46.438241Z","iopub.execute_input":"2022-01-01T11:46:46.438923Z","iopub.status.idle":"2022-01-01T11:46:47.128794Z","shell.execute_reply.started":"2022-01-01T11:46:46.438892Z","shell.execute_reply":"2022-01-01T11:46:47.128107Z"},"trusted":true},"execution_count":7,"outputs":[{"name":"stdout","text":"\n8 trials detected for ('learning_rate', 'num_leaves', 'max_depth')\n\ntrial: 0001 ### iterations: 00023 ### eval_score: 0.07643\ntrial: 0002 ### iterations: 00052 ### eval_score: 0.06818\ntrial: 0003 ### iterations: 00062 ### eval_score: 0.07042\ntrial: 0004 ### iterations: 00033 ### eval_score: 0.07035\ntrial: 0005 ### iterations: 00032 ### eval_score: 0.07153\ntrial: 0006 ### iterations: 00012 ### eval_score: 0.07547\ntrial: 0007 ### iterations: 00041 ### eval_score: 0.07355\ntrial: 0008 ### iterations: 00025 ### eval_score: 0.07805\n","output_type":"stream"},{"execution_count":7,"output_type":"execute_result","data":{"text/plain":"BoostSearch(estimator=LGBMRegressor(n_estimators=150, random_state=0), n_iter=8,\n param_grid={'learning_rate': <scipy.stats._distn_infrastructure.rv_frozen object at 0x7fd50407f2d0>,\n 'max_depth': [10, 12],\n 'num_leaves': <scipy.stats._distn_infrastructure.rv_frozen object at 0x7fd50407f590>},\n sampling_seed=0)"},"metadata":{}}]},{"cell_type":"code","source":"model.estimator_, model.best_params_, model.best_score_","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:46:47.132071Z","iopub.execute_input":"2022-01-01T11:46:47.132611Z","iopub.status.idle":"2022-01-01T11:46:47.142185Z","shell.execute_reply.started":"2022-01-01T11:46:47.132575Z","shell.execute_reply":"2022-01-01T11:46:47.141271Z"},"trusted":true},"execution_count":8,"outputs":[{"execution_count":8,"output_type":"execute_result","data":{"text/plain":"(LGBMRegressor(learning_rate=0.1350674222191923, max_depth=10, n_estimators=150,\n num_leaves=38, random_state=0),\n {'learning_rate': 0.1350674222191923, 'num_leaves': 38, 'max_depth': 10},\n 0.06817737242646997)"},"metadata":{}}]},{"cell_type":"code","source":"(model.score(X_regr_valid, y_regr_valid), \n model.predict(X_regr_valid).shape, \n model.predict(X_regr_valid, pred_contrib=True).shape)","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:46:47.143613Z","iopub.execute_input":"2022-01-01T11:46:47.143856Z","iopub.status.idle":"2022-01-01T11:46:47.611056Z","shell.execute_reply.started":"2022-01-01T11:46:47.143827Z","shell.execute_reply":"2022-01-01T11:46:47.610379Z"},"trusted":true},"execution_count":9,"outputs":[{"execution_count":9,"output_type":"execute_result","data":{"text/plain":"(0.7272820930747703, (1800,), (1800, 21))"},"metadata":{}}]},{"cell_type":"code","source":"### HYPERPARAM TUNING WITH HYPEROPT ###\n\nmodel = BoostSearch(\n regr_lgbm, param_grid=param_dist_hyperopt,\n n_iter=8, sampling_seed=0\n)\nmodel.fit(\n X_regr_train, y_regr_train, trials=Trials(), \n eval_set=[(X_regr_valid, y_regr_valid)], early_stopping_rounds=6, verbose=0\n)","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:46:47.614530Z","iopub.execute_input":"2022-01-01T11:46:47.616779Z","iopub.status.idle":"2022-01-01T11:46:49.268236Z","shell.execute_reply.started":"2022-01-01T11:46:47.616738Z","shell.execute_reply":"2022-01-01T11:46:49.267608Z"},"trusted":true},"execution_count":10,"outputs":[{"name":"stdout","text":"\n8 trials detected for ('max_depth', 'learning_rate', 'colsample_bytree')\n\ntrial: 0001 ### iterations: 00149 ### eval_score: 0.06979\ntrial: 0002 ### iterations: 00055 ### eval_score: 0.07039\ntrial: 0003 ### iterations: 00056 ### eval_score: 0.0716\ntrial: 0004 ### iterations: 00150 ### eval_score: 0.07352\ntrial: 0005 ### iterations: 00150 ### eval_score: 0.07936\ntrial: 0006 ### iterations: 00147 ### eval_score: 0.06833\ntrial: 0007 ### iterations: 00032 ### eval_score: 0.07261\ntrial: 0008 ### iterations: 00096 ### eval_score: 0.07074\n","output_type":"stream"},{"execution_count":10,"output_type":"execute_result","data":{"text/plain":"BoostSearch(estimator=LGBMRegressor(n_estimators=150, random_state=0), n_iter=8,\n param_grid={'colsample_bytree': <hyperopt.pyll.base.Apply object at 0x7fd50407fd10>,\n 'learning_rate': <hyperopt.pyll.base.Apply object at 0x7fd50407fa50>,\n 'max_depth': <hyperopt.pyll.base.Apply object at 0x7fd50407f710>},\n sampling_seed=0)"},"metadata":{}}]},{"cell_type":"code","source":"model.estimator_, model.best_params_, model.best_score_","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:46:49.271739Z","iopub.execute_input":"2022-01-01T11:46:49.272301Z","iopub.status.idle":"2022-01-01T11:46:49.279337Z","shell.execute_reply.started":"2022-01-01T11:46:49.272264Z","shell.execute_reply":"2022-01-01T11:46:49.278727Z"},"trusted":true},"execution_count":11,"outputs":[{"execution_count":11,"output_type":"execute_result","data":{"text/plain":"(LGBMRegressor(colsample_bytree=0.7597292534356749,\n learning_rate=0.059836658149176665, max_depth=16,\n n_estimators=150, random_state=0),\n {'colsample_bytree': 0.7597292534356749,\n 'learning_rate': 0.059836658149176665,\n 'max_depth': 16},\n 0.06832542425080958)"},"metadata":{}}]},{"cell_type":"code","source":"(model.score(X_regr_valid, y_regr_valid), \n model.predict(X_regr_valid).shape, \n model.predict(X_regr_valid, pred_contrib=True).shape)","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:46:49.280499Z","iopub.execute_input":"2022-01-01T11:46:49.280735Z","iopub.status.idle":"2022-01-01T11:46:50.260345Z","shell.execute_reply.started":"2022-01-01T11:46:49.280700Z","shell.execute_reply":"2022-01-01T11:46:50.259694Z"},"trusted":true},"execution_count":12,"outputs":[{"execution_count":12,"output_type":"execute_result","data":{"text/plain":"(0.7266898674988451, (1800,), (1800, 21))"},"metadata":{}}]},{"cell_type":"markdown","source":"# Features Selection","metadata":{}},{"cell_type":"code","source":"### BORUTA ###\n\nmodel = BoostBoruta(clf_lgbm, max_iter=200, perc=100)\nmodel.fit(X_clf_train, y_clf_train, eval_set=[(X_clf_valid, y_clf_valid)], early_stopping_rounds=6, verbose=0)","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:46:50.263726Z","iopub.execute_input":"2022-01-01T11:46:50.265917Z","iopub.status.idle":"2022-01-01T11:46:56.714012Z","shell.execute_reply.started":"2022-01-01T11:46:50.265869Z","shell.execute_reply":"2022-01-01T11:46:56.713278Z"},"trusted":true},"execution_count":13,"outputs":[{"execution_count":13,"output_type":"execute_result","data":{"text/plain":"BoostBoruta(estimator=LGBMClassifier(n_estimators=150, random_state=0),\n max_iter=200)"},"metadata":{}}]},{"cell_type":"code","source":"model.estimator_, model.n_features_","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:46:56.720017Z","iopub.execute_input":"2022-01-01T11:46:56.720486Z","iopub.status.idle":"2022-01-01T11:46:56.727782Z","shell.execute_reply.started":"2022-01-01T11:46:56.720450Z","shell.execute_reply":"2022-01-01T11:46:56.726815Z"},"trusted":true},"execution_count":14,"outputs":[{"execution_count":14,"output_type":"execute_result","data":{"text/plain":"(LGBMClassifier(n_estimators=150, random_state=0), 10)"},"metadata":{}}]},{"cell_type":"code","source":"(model.score(X_clf_valid, y_clf_valid), \n model.predict(X_clf_valid).shape, \n model.transform(X_clf_valid).shape,\n model.predict_proba(X_clf_valid).shape)","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:46:56.730004Z","iopub.execute_input":"2022-01-01T11:46:56.730326Z","iopub.status.idle":"2022-01-01T11:46:56.765852Z","shell.execute_reply.started":"2022-01-01T11:46:56.730286Z","shell.execute_reply":"2022-01-01T11:46:56.760625Z"},"trusted":true},"execution_count":15,"outputs":[{"execution_count":15,"output_type":"execute_result","data":{"text/plain":"(0.91, (1800,), (1800, 10), (1800, 2))"},"metadata":{}}]},{"cell_type":"code","source":"### RECURSIVE FEATURE ELIMINATION (RFE) ###\n\nmodel = BoostRFE(regr_lgbm, min_features_to_select=1, step=1)\nmodel.fit(X_regr_train, y_regr_train, eval_set=[(X_regr_valid, y_regr_valid)], early_stopping_rounds=6, verbose=0)","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:46:56.767160Z","iopub.execute_input":"2022-01-01T11:46:56.767432Z","iopub.status.idle":"2022-01-01T11:46:59.411924Z","shell.execute_reply.started":"2022-01-01T11:46:56.767401Z","shell.execute_reply":"2022-01-01T11:46:59.411240Z"},"trusted":true},"execution_count":16,"outputs":[{"execution_count":16,"output_type":"execute_result","data":{"text/plain":"BoostRFE(estimator=LGBMRegressor(n_estimators=150, random_state=0),\n min_features_to_select=1)"},"metadata":{}}]},{"cell_type":"code","source":"model.estimator_, model.n_features_","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:46:59.415300Z","iopub.execute_input":"2022-01-01T11:46:59.417330Z","iopub.status.idle":"2022-01-01T11:46:59.424201Z","shell.execute_reply.started":"2022-01-01T11:46:59.417288Z","shell.execute_reply":"2022-01-01T11:46:59.423561Z"},"trusted":true},"execution_count":17,"outputs":[{"execution_count":17,"output_type":"execute_result","data":{"text/plain":"(LGBMRegressor(n_estimators=150, random_state=0), 7)"},"metadata":{}}]},{"cell_type":"code","source":"(model.score(X_regr_valid, y_regr_valid), \n model.predict(X_regr_valid).shape, \n model.transform(X_regr_valid).shape,\n model.predict(X_regr_valid, pred_contrib=True).shape)","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:46:59.425449Z","iopub.execute_input":"2022-01-01T11:46:59.425674Z","iopub.status.idle":"2022-01-01T11:47:00.248420Z","shell.execute_reply.started":"2022-01-01T11:46:59.425645Z","shell.execute_reply":"2022-01-01T11:47:00.247703Z"},"trusted":true},"execution_count":18,"outputs":[{"execution_count":18,"output_type":"execute_result","data":{"text/plain":"(0.7766363424352807, (1800,), (1800, 7), (1800, 8))"},"metadata":{}}]},{"cell_type":"code","source":"### RECURSIVE FEATURE ADDITION (RFA) ###\n\nmodel = BoostRFA(regr_lgbm, min_features_to_select=1, step=1)\nmodel.fit(X_regr_train, y_regr_train, eval_set=[(X_regr_valid, y_regr_valid)], early_stopping_rounds=6, verbose=0)","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:47:00.251993Z","iopub.execute_input":"2022-01-01T11:47:00.252510Z","iopub.status.idle":"2022-01-01T11:47:03.954790Z","shell.execute_reply.started":"2022-01-01T11:47:00.252473Z","shell.execute_reply":"2022-01-01T11:47:03.954052Z"},"trusted":true},"execution_count":19,"outputs":[{"execution_count":19,"output_type":"execute_result","data":{"text/plain":"BoostRFA(estimator=LGBMRegressor(n_estimators=150, random_state=0),\n min_features_to_select=1)"},"metadata":{}}]},{"cell_type":"code","source":"model.estimator_, model.n_features_","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:47:03.958397Z","iopub.execute_input":"2022-01-01T11:47:03.958982Z","iopub.status.idle":"2022-01-01T11:47:03.967715Z","shell.execute_reply.started":"2022-01-01T11:47:03.958931Z","shell.execute_reply":"2022-01-01T11:47:03.966909Z"},"trusted":true},"execution_count":20,"outputs":[{"execution_count":20,"output_type":"execute_result","data":{"text/plain":"(LGBMRegressor(n_estimators=150, random_state=0), 8)"},"metadata":{}}]},{"cell_type":"code","source":"(model.score(X_regr_valid, y_regr_valid), \n model.predict(X_regr_valid).shape, \n model.transform(X_regr_valid).shape,\n model.predict(X_regr_valid, pred_contrib=True).shape)","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:47:03.969215Z","iopub.execute_input":"2022-01-01T11:47:03.969612Z","iopub.status.idle":"2022-01-01T11:47:04.838820Z","shell.execute_reply.started":"2022-01-01T11:47:03.969569Z","shell.execute_reply":"2022-01-01T11:47:04.838192Z"},"trusted":true},"execution_count":21,"outputs":[{"execution_count":21,"output_type":"execute_result","data":{"text/plain":"(0.7723191919698336, (1800,), (1800, 8), (1800, 9))"},"metadata":{}}]},{"cell_type":"markdown","source":"# Features Selection with SHAP","metadata":{}},{"cell_type":"code","source":"### BORUTA SHAP ###\n\nmodel = BoostBoruta(\n clf_lgbm, max_iter=200, perc=100,\n importance_type='shap_importances', train_importance=False\n)\nmodel.fit(X_clf_train, y_clf_train, eval_set=[(X_clf_valid, y_clf_valid)], early_stopping_rounds=6, verbose=0)","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:47:04.842289Z","iopub.execute_input":"2022-01-01T11:47:04.844564Z","iopub.status.idle":"2022-01-01T11:47:17.780389Z","shell.execute_reply.started":"2022-01-01T11:47:04.844522Z","shell.execute_reply":"2022-01-01T11:47:17.779726Z"},"trusted":true},"execution_count":22,"outputs":[{"execution_count":22,"output_type":"execute_result","data":{"text/plain":"BoostBoruta(estimator=LGBMClassifier(n_estimators=150, random_state=0),\n importance_type='shap_importances', max_iter=200,\n train_importance=False)"},"metadata":{}}]},{"cell_type":"code","source":"model.estimator_, model.n_features_","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:47:17.781535Z","iopub.execute_input":"2022-01-01T11:47:17.784569Z","iopub.status.idle":"2022-01-01T11:47:17.791371Z","shell.execute_reply.started":"2022-01-01T11:47:17.784530Z","shell.execute_reply":"2022-01-01T11:47:17.790591Z"},"trusted":true},"execution_count":23,"outputs":[{"execution_count":23,"output_type":"execute_result","data":{"text/plain":"(LGBMClassifier(n_estimators=150, random_state=0), 9)"},"metadata":{}}]},{"cell_type":"code","source":"(model.score(X_clf_valid, y_clf_valid), \n model.predict(X_clf_valid).shape, \n model.transform(X_clf_valid).shape,\n model.predict_proba(X_clf_valid).shape)","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:47:17.794450Z","iopub.execute_input":"2022-01-01T11:47:17.794986Z","iopub.status.idle":"2022-01-01T11:47:17.813842Z","shell.execute_reply.started":"2022-01-01T11:47:17.794933Z","shell.execute_reply":"2022-01-01T11:47:17.813126Z"},"trusted":true},"execution_count":24,"outputs":[{"execution_count":24,"output_type":"execute_result","data":{"text/plain":"(0.9111111111111111, (1800,), (1800, 9), (1800, 2))"},"metadata":{}}]},{"cell_type":"code","source":"### RECURSIVE FEATURE ELIMINATION (RFE) SHAP ###\n\nmodel = BoostRFE(\n regr_lgbm, min_features_to_select=1, step=1,\n importance_type='shap_importances', train_importance=False\n)\nmodel.fit(X_regr_train, y_regr_train, eval_set=[(X_regr_valid, y_regr_valid)], early_stopping_rounds=6, verbose=0)","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:47:17.817477Z","iopub.execute_input":"2022-01-01T11:47:17.819641Z","iopub.status.idle":"2022-01-01T11:47:32.735329Z","shell.execute_reply.started":"2022-01-01T11:47:17.819595Z","shell.execute_reply":"2022-01-01T11:47:32.734687Z"},"trusted":true},"execution_count":25,"outputs":[{"execution_count":25,"output_type":"execute_result","data":{"text/plain":"BoostRFE(estimator=LGBMRegressor(n_estimators=150, random_state=0),\n importance_type='shap_importances', min_features_to_select=1,\n train_importance=False)"},"metadata":{}}]},{"cell_type":"code","source":"model.estimator_, model.n_features_","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:47:32.736646Z","iopub.execute_input":"2022-01-01T11:47:32.737109Z","iopub.status.idle":"2022-01-01T11:47:32.743398Z","shell.execute_reply.started":"2022-01-01T11:47:32.737074Z","shell.execute_reply":"2022-01-01T11:47:32.742747Z"},"trusted":true},"execution_count":26,"outputs":[{"execution_count":26,"output_type":"execute_result","data":{"text/plain":"(LGBMRegressor(n_estimators=150, random_state=0), 7)"},"metadata":{}}]},{"cell_type":"code","source":"(model.score(X_regr_valid, y_regr_valid), \n model.predict(X_regr_valid).shape, \n model.transform(X_regr_valid).shape,\n model.predict(X_regr_valid, pred_contrib=True).shape)","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:47:32.744765Z","iopub.execute_input":"2022-01-01T11:47:32.747374Z","iopub.status.idle":"2022-01-01T11:47:33.570515Z","shell.execute_reply.started":"2022-01-01T11:47:32.747336Z","shell.execute_reply":"2022-01-01T11:47:33.569899Z"},"trusted":true},"execution_count":27,"outputs":[{"execution_count":27,"output_type":"execute_result","data":{"text/plain":"(0.7766363424352807, (1800,), (1800, 7), (1800, 8))"},"metadata":{}}]},{"cell_type":"code","source":"### RECURSIVE FEATURE ADDITION (RFA) SHAP ###\n\nmodel = BoostRFA(\n regr_lgbm, min_features_to_select=1, step=1,\n importance_type='shap_importances', train_importance=False\n)\nmodel.fit(X_regr_train, y_regr_train, eval_set=[(X_regr_valid, y_regr_valid)], early_stopping_rounds=6, verbose=0)","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:47:33.571778Z","iopub.execute_input":"2022-01-01T11:47:33.572261Z","iopub.status.idle":"2022-01-01T11:47:39.941084Z","shell.execute_reply.started":"2022-01-01T11:47:33.572226Z","shell.execute_reply":"2022-01-01T11:47:39.940356Z"},"trusted":true},"execution_count":28,"outputs":[{"execution_count":28,"output_type":"execute_result","data":{"text/plain":"BoostRFA(estimator=LGBMRegressor(n_estimators=150, random_state=0),\n importance_type='shap_importances', min_features_to_select=1,\n train_importance=False)"},"metadata":{}}]},{"cell_type":"code","source":"model.estimator_, model.n_features_","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:47:39.944497Z","iopub.execute_input":"2022-01-01T11:47:39.946592Z","iopub.status.idle":"2022-01-01T11:47:39.953717Z","shell.execute_reply.started":"2022-01-01T11:47:39.946550Z","shell.execute_reply":"2022-01-01T11:47:39.952924Z"},"trusted":true},"execution_count":29,"outputs":[{"execution_count":29,"output_type":"execute_result","data":{"text/plain":"(LGBMRegressor(n_estimators=150, random_state=0), 9)"},"metadata":{}}]},{"cell_type":"code","source":"(model.score(X_regr_valid, y_regr_valid), \n model.predict(X_regr_valid).shape, \n model.transform(X_regr_valid).shape,\n model.predict(X_regr_valid, pred_contrib=True).shape)","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:47:39.954955Z","iopub.execute_input":"2022-01-01T11:47:39.955713Z","iopub.status.idle":"2022-01-01T11:47:40.853749Z","shell.execute_reply.started":"2022-01-01T11:47:39.955669Z","shell.execute_reply":"2022-01-01T11:47:40.853100Z"},"trusted":true},"execution_count":30,"outputs":[{"execution_count":30,"output_type":"execute_result","data":{"text/plain":"(0.7699366468805918, (1800,), (1800, 9), (1800, 10))"},"metadata":{}}]},{"cell_type":"markdown","source":"# Hyperparameters Tuning + Features Selection","metadata":{}},{"cell_type":"code","source":"### HYPERPARAM TUNING WITH GRID-SEARCH + BORUTA ###\n\nmodel = BoostBoruta(clf_lgbm, param_grid=param_grid, max_iter=200, perc=100)\nmodel.fit(X_clf_train, y_clf_train, eval_set=[(X_clf_valid, y_clf_valid)], early_stopping_rounds=6, verbose=0)","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:47:40.857000Z","iopub.execute_input":"2022-01-01T11:47:40.859123Z","iopub.status.idle":"2022-01-01T11:48:08.045782Z","shell.execute_reply.started":"2022-01-01T11:47:40.859074Z","shell.execute_reply":"2022-01-01T11:48:08.043191Z"},"trusted":true},"execution_count":31,"outputs":[{"name":"stdout","text":"\n8 trials detected for ('learning_rate', 'num_leaves', 'max_depth')\n\ntrial: 0001 ### iterations: 00023 ### eval_score: 0.19868\ntrial: 0002 ### iterations: 00030 ### eval_score: 0.19844\ntrial: 0003 ### iterations: 00023 ### eval_score: 0.19695\ntrial: 0004 ### iterations: 00026 ### eval_score: 0.19949\ntrial: 0005 ### iterations: 00067 ### eval_score: 0.19583\ntrial: 0006 ### iterations: 00051 ### eval_score: 0.1949\ntrial: 0007 ### iterations: 00045 ### eval_score: 0.19675\ntrial: 0008 ### iterations: 00055 ### eval_score: 0.19906\n","output_type":"stream"},{"execution_count":31,"output_type":"execute_result","data":{"text/plain":"BoostBoruta(estimator=LGBMClassifier(n_estimators=150, random_state=0),\n max_iter=200,\n param_grid={'learning_rate': [0.2, 0.1], 'max_depth': [10, 12],\n 'num_leaves': [25, 35]})"},"metadata":{}}]},{"cell_type":"code","source":"model.estimator_, model.best_params_, model.best_score_, model.n_features_","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:48:08.047190Z","iopub.execute_input":"2022-01-01T11:48:08.048047Z","iopub.status.idle":"2022-01-01T11:48:08.056353Z","shell.execute_reply.started":"2022-01-01T11:48:08.048000Z","shell.execute_reply":"2022-01-01T11:48:08.055615Z"},"trusted":true},"execution_count":32,"outputs":[{"execution_count":32,"output_type":"execute_result","data":{"text/plain":"(LGBMClassifier(max_depth=12, n_estimators=150, num_leaves=25, random_state=0),\n {'learning_rate': 0.1, 'num_leaves': 25, 'max_depth': 12},\n 0.19489866976777023,\n 9)"},"metadata":{}}]},{"cell_type":"code","source":"(model.score(X_clf_valid, y_clf_valid), \n model.predict(X_clf_valid).shape, \n model.transform(X_clf_valid).shape,\n model.predict_proba(X_clf_valid).shape)","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:48:08.058015Z","iopub.execute_input":"2022-01-01T11:48:08.058593Z","iopub.status.idle":"2022-01-01T11:48:08.109632Z","shell.execute_reply.started":"2022-01-01T11:48:08.058410Z","shell.execute_reply":"2022-01-01T11:48:08.108670Z"},"trusted":true},"execution_count":33,"outputs":[{"execution_count":33,"output_type":"execute_result","data":{"text/plain":"(0.915, (1800,), (1800, 9), (1800, 2))"},"metadata":{}}]},{"cell_type":"code","source":"### HYPERPARAM TUNING WITH RANDOM-SEARCH + RECURSIVE FEATURE ELIMINATION (RFE) ###\n\nmodel = BoostRFE(\n regr_lgbm, param_grid=param_dist, min_features_to_select=1, step=1,\n n_iter=8, sampling_seed=0\n)\nmodel.fit(X_regr_train, y_regr_train, eval_set=[(X_regr_valid, y_regr_valid)], early_stopping_rounds=6, verbose=0)","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:48:08.114460Z","iopub.execute_input":"2022-01-01T11:48:08.116626Z","iopub.status.idle":"2022-01-01T11:48:20.506235Z","shell.execute_reply.started":"2022-01-01T11:48:08.116579Z","shell.execute_reply":"2022-01-01T11:48:20.505511Z"},"trusted":true},"execution_count":34,"outputs":[{"name":"stdout","text":"\n8 trials detected for ('learning_rate', 'num_leaves', 'max_depth')\n\ntrial: 0001 ### iterations: 00107 ### eval_score: 0.06016\ntrial: 0002 ### iterations: 00095 ### eval_score: 0.05711\ntrial: 0003 ### iterations: 00121 ### eval_score: 0.05926\ntrial: 0004 ### iterations: 00103 ### eval_score: 0.05688\ntrial: 0005 ### iterations: 00119 ### eval_score: 0.05618\ntrial: 0006 ### iterations: 00049 ### eval_score: 0.06188\ntrial: 0007 ### iterations: 00150 ### eval_score: 0.05538\ntrial: 0008 ### iterations: 00083 ### eval_score: 0.06084\n","output_type":"stream"},{"execution_count":34,"output_type":"execute_result","data":{"text/plain":"BoostRFE(estimator=LGBMRegressor(n_estimators=150, random_state=0),\n min_features_to_select=1, n_iter=8,\n param_grid={'learning_rate': <scipy.stats._distn_infrastructure.rv_frozen object at 0x7fd50407f2d0>,\n 'max_depth': [10, 12],\n 'num_leaves': <scipy.stats._distn_infrastructure.rv_frozen object at 0x7fd50407f590>},\n sampling_seed=0)"},"metadata":{}}]},{"cell_type":"code","source":"model.estimator_, model.best_params_, model.best_score_, model.n_features_","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:48:20.509788Z","iopub.execute_input":"2022-01-01T11:48:20.511633Z","iopub.status.idle":"2022-01-01T11:48:20.521139Z","shell.execute_reply.started":"2022-01-01T11:48:20.511592Z","shell.execute_reply":"2022-01-01T11:48:20.520293Z"},"trusted":true},"execution_count":35,"outputs":[{"execution_count":35,"output_type":"execute_result","data":{"text/plain":"(LGBMRegressor(learning_rate=0.13639381870463482, max_depth=12, n_estimators=150,\n num_leaves=25, random_state=0),\n {'learning_rate': 0.13639381870463482, 'num_leaves': 25, 'max_depth': 12},\n 0.0553821617278472,\n 7)"},"metadata":{}}]},{"cell_type":"code","source":"(model.score(X_regr_valid, y_regr_valid), \n model.predict(X_regr_valid).shape, \n model.transform(X_regr_valid).shape,\n model.predict(X_regr_valid, pred_contrib=True).shape)","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:48:20.522443Z","iopub.execute_input":"2022-01-01T11:48:20.522817Z","iopub.status.idle":"2022-01-01T11:48:21.145683Z","shell.execute_reply.started":"2022-01-01T11:48:20.522785Z","shell.execute_reply":"2022-01-01T11:48:21.145033Z"},"trusted":true},"execution_count":36,"outputs":[{"execution_count":36,"output_type":"execute_result","data":{"text/plain":"(0.7784645155736596, (1800,), (1800, 7), (1800, 8))"},"metadata":{}}]},{"cell_type":"code","source":"### HYPERPARAM TUNING WITH HYPEROPT + RECURSIVE FEATURE ADDITION (RFA) ###\n\nmodel = BoostRFA(\n regr_lgbm, param_grid=param_dist_hyperopt, min_features_to_select=1, step=1,\n n_iter=8, sampling_seed=0\n)\nmodel.fit(\n X_regr_train, y_regr_train, trials=Trials(), \n eval_set=[(X_regr_valid, y_regr_valid)], early_stopping_rounds=6, verbose=0\n)","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:48:21.149492Z","iopub.execute_input":"2022-01-01T11:48:21.151302Z","iopub.status.idle":"2022-01-01T11:48:56.679453Z","shell.execute_reply.started":"2022-01-01T11:48:21.151261Z","shell.execute_reply":"2022-01-01T11:48:56.678720Z"},"trusted":true},"execution_count":37,"outputs":[{"name":"stdout","text":"\n8 trials detected for ('max_depth', 'learning_rate', 'colsample_bytree')\n\ntrial: 0001 ### iterations: 00150 ### eval_score: 0.06507\ntrial: 0002 ### iterations: 00075 ### eval_score: 0.05784\ntrial: 0003 ### iterations: 00095 ### eval_score: 0.06088\ntrial: 0004 ### iterations: 00150 ### eval_score: 0.06976\ntrial: 0005 ### iterations: 00150 ### eval_score: 0.07593\ntrial: 0006 ### iterations: 00149 ### eval_score: 0.05995\ntrial: 0007 ### iterations: 00058 ### eval_score: 0.05916\ntrial: 0008 ### iterations: 00150 ### eval_score: 0.06366\n","output_type":"stream"},{"execution_count":37,"output_type":"execute_result","data":{"text/plain":"BoostRFA(estimator=LGBMRegressor(n_estimators=150, random_state=0),\n min_features_to_select=1, n_iter=8,\n param_grid={'colsample_bytree': <hyperopt.pyll.base.Apply object at 0x7fd50407fd10>,\n 'learning_rate': <hyperopt.pyll.base.Apply object at 0x7fd50407fa50>,\n 'max_depth': <hyperopt.pyll.base.Apply object at 0x7fd50407f710>},\n sampling_seed=0)"},"metadata":{}}]},{"cell_type":"code","source":"model.estimator_, model.best_params_, model.best_score_, model.n_features_","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:48:56.682847Z","iopub.execute_input":"2022-01-01T11:48:56.684405Z","iopub.status.idle":"2022-01-01T11:48:56.691812Z","shell.execute_reply.started":"2022-01-01T11:48:56.684368Z","shell.execute_reply":"2022-01-01T11:48:56.690932Z"},"trusted":true},"execution_count":38,"outputs":[{"execution_count":38,"output_type":"execute_result","data":{"text/plain":"(LGBMRegressor(colsample_bytree=0.8515260655364685,\n learning_rate=0.13520045129619862, max_depth=18, n_estimators=150,\n random_state=0),\n {'colsample_bytree': 0.8515260655364685,\n 'learning_rate': 0.13520045129619862,\n 'max_depth': 18},\n 0.0578369356489881,\n 8)"},"metadata":{}}]},{"cell_type":"code","source":"(model.score(X_regr_valid, y_regr_valid), \n model.predict(X_regr_valid).shape, \n model.transform(X_regr_valid).shape,\n model.predict(X_regr_valid, pred_contrib=True).shape)","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:48:56.693078Z","iopub.execute_input":"2022-01-01T11:48:56.693305Z","iopub.status.idle":"2022-01-01T11:48:57.115924Z","shell.execute_reply.started":"2022-01-01T11:48:56.693277Z","shell.execute_reply":"2022-01-01T11:48:57.115308Z"},"trusted":true},"execution_count":39,"outputs":[{"execution_count":39,"output_type":"execute_result","data":{"text/plain":"(0.7686451168212334, (1800,), (1800, 8), (1800, 9))"},"metadata":{}}]},{"cell_type":"markdown","source":"# Hyperparameters Tuning + Features Selection with SHAP","metadata":{}},{"cell_type":"code","source":"### HYPERPARAM TUNING WITH GRID-SEARCH + BORUTA SHAP ###\n\nmodel = BoostBoruta(\n clf_lgbm, param_grid=param_grid, max_iter=200, perc=100,\n importance_type='shap_importances', train_importance=False\n)\nmodel.fit(X_clf_train, y_clf_train, eval_set=[(X_clf_valid, y_clf_valid)], early_stopping_rounds=6, verbose=0)","metadata":{"scrolled":true,"execution":{"iopub.status.busy":"2022-01-01T11:48:57.119397Z","iopub.execute_input":"2022-01-01T11:48:57.120009Z","iopub.status.idle":"2022-01-01T11:50:15.982498Z","shell.execute_reply.started":"2022-01-01T11:48:57.119958Z","shell.execute_reply":"2022-01-01T11:50:15.981774Z"},"trusted":true},"execution_count":40,"outputs":[{"name":"stdout","text":"\n8 trials detected for ('learning_rate', 'num_leaves', 'max_depth')\n\ntrial: 0001 ### iterations: 00036 ### eval_score: 0.19716\ntrial: 0002 ### iterations: 00030 ### eval_score: 0.19818\ntrial: 0003 ### iterations: 00031 ### eval_score: 0.19881\ntrial: 0004 ### iterations: 00026 ### eval_score: 0.19949\ntrial: 0005 ### iterations: 00067 ### eval_score: 0.19583\ntrial: 0006 ### iterations: 00051 ### eval_score: 0.1949\ntrial: 0007 ### iterations: 00045 ### eval_score: 0.19675\ntrial: 0008 ### iterations: 00057 ### eval_score: 0.19284\n","output_type":"stream"},{"execution_count":40,"output_type":"execute_result","data":{"text/plain":"BoostBoruta(estimator=LGBMClassifier(n_estimators=150, random_state=0),\n importance_type='shap_importances', max_iter=200,\n param_grid={'learning_rate': [0.2, 0.1], 'max_depth': [10, 12],\n 'num_leaves': [25, 35]},\n train_importance=False)"},"metadata":{}}]},{"cell_type":"code","source":"model.estimator_, model.best_params_, model.best_score_, model.n_features_","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:50:15.988196Z","iopub.execute_input":"2022-01-01T11:50:15.988729Z","iopub.status.idle":"2022-01-01T11:50:15.996898Z","shell.execute_reply.started":"2022-01-01T11:50:15.988685Z","shell.execute_reply":"2022-01-01T11:50:15.996175Z"},"trusted":true},"execution_count":41,"outputs":[{"execution_count":41,"output_type":"execute_result","data":{"text/plain":"(LGBMClassifier(max_depth=12, n_estimators=150, num_leaves=35, random_state=0),\n {'learning_rate': 0.1, 'num_leaves': 35, 'max_depth': 12},\n 0.1928371931511303,\n 10)"},"metadata":{}}]},{"cell_type":"code","source":"(model.score(X_clf_valid, y_clf_valid), \n model.predict(X_clf_valid).shape, \n model.transform(X_clf_valid).shape,\n model.predict_proba(X_clf_valid).shape)","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:50:15.998631Z","iopub.execute_input":"2022-01-01T11:50:15.999269Z","iopub.status.idle":"2022-01-01T11:50:16.029050Z","shell.execute_reply.started":"2022-01-01T11:50:15.999228Z","shell.execute_reply":"2022-01-01T11:50:16.028270Z"},"trusted":true},"execution_count":42,"outputs":[{"execution_count":42,"output_type":"execute_result","data":{"text/plain":"(0.9111111111111111, (1800,), (1800, 10), (1800, 2))"},"metadata":{}}]},{"cell_type":"code","source":"### HYPERPARAM TUNING WITH RANDOM-SEARCH + RECURSIVE FEATURE ELIMINATION (RFE) SHAP ###\n\nmodel = BoostRFE(\n regr_lgbm, param_grid=param_dist, min_features_to_select=1, step=1,\n n_iter=8, sampling_seed=0,\n importance_type='shap_importances', train_importance=False\n)\nmodel.fit(X_regr_train, y_regr_train, eval_set=[(X_regr_valid, y_regr_valid)], early_stopping_rounds=6, verbose=0)","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:50:16.030261Z","iopub.execute_input":"2022-01-01T11:50:16.030658Z","iopub.status.idle":"2022-01-01T11:51:19.095150Z","shell.execute_reply.started":"2022-01-01T11:50:16.030625Z","shell.execute_reply":"2022-01-01T11:51:19.094483Z"},"trusted":true},"execution_count":43,"outputs":[{"name":"stdout","text":"\n8 trials detected for ('learning_rate', 'num_leaves', 'max_depth')\n\ntrial: 0001 ### iterations: 00107 ### eval_score: 0.06016\ntrial: 0002 ### iterations: 00102 ### eval_score: 0.05525\ntrial: 0003 ### iterations: 00150 ### eval_score: 0.05869\ntrial: 0004 ### iterations: 00149 ### eval_score: 0.05863\ntrial: 0005 ### iterations: 00119 ### eval_score: 0.05618\ntrial: 0006 ### iterations: 00049 ### eval_score: 0.06188\ntrial: 0007 ### iterations: 00150 ### eval_score: 0.05538\ntrial: 0008 ### iterations: 00083 ### eval_score: 0.06084\n","output_type":"stream"},{"execution_count":43,"output_type":"execute_result","data":{"text/plain":"BoostRFE(estimator=LGBMRegressor(n_estimators=150, random_state=0),\n importance_type='shap_importances', min_features_to_select=1, n_iter=8,\n param_grid={'learning_rate': <scipy.stats._distn_infrastructure.rv_frozen object at 0x7fd50407f2d0>,\n 'max_depth': [10, 12],\n 'num_leaves': <scipy.stats._distn_infrastructure.rv_frozen object at 0x7fd50407f590>},\n sampling_seed=0, train_importance=False)"},"metadata":{}}]},{"cell_type":"code","source":"model.estimator_, model.best_params_, model.best_score_, model.n_features_","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:51:19.098487Z","iopub.execute_input":"2022-01-01T11:51:19.099062Z","iopub.status.idle":"2022-01-01T11:51:19.108772Z","shell.execute_reply.started":"2022-01-01T11:51:19.099027Z","shell.execute_reply":"2022-01-01T11:51:19.107939Z"},"trusted":true},"execution_count":44,"outputs":[{"execution_count":44,"output_type":"execute_result","data":{"text/plain":"(LGBMRegressor(learning_rate=0.1350674222191923, max_depth=10, n_estimators=150,\n num_leaves=38, random_state=0),\n {'learning_rate': 0.1350674222191923, 'num_leaves': 38, 'max_depth': 10},\n 0.05524518772497125,\n 9)"},"metadata":{}}]},{"cell_type":"code","source":"(model.score(X_regr_valid, y_regr_valid), \n model.predict(X_regr_valid).shape, \n model.transform(X_regr_valid).shape,\n model.predict(X_regr_valid, pred_contrib=True).shape)","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:51:19.110141Z","iopub.execute_input":"2022-01-01T11:51:19.110358Z","iopub.status.idle":"2022-01-01T11:51:19.840667Z","shell.execute_reply.started":"2022-01-01T11:51:19.110333Z","shell.execute_reply":"2022-01-01T11:51:19.840035Z"},"trusted":true},"execution_count":45,"outputs":[{"execution_count":45,"output_type":"execute_result","data":{"text/plain":"(0.779012428496056, (1800,), (1800, 9), (1800, 10))"},"metadata":{}}]},{"cell_type":"code","source":"### HYPERPARAM TUNING WITH HYPEROPT + RECURSIVE FEATURE ADDITION (RFA) SHAP ###\n\nmodel = BoostRFA(\n regr_lgbm, param_grid=param_dist_hyperopt, min_features_to_select=1, step=1,\n n_iter=8, sampling_seed=0,\n importance_type='shap_importances', train_importance=False\n)\nmodel.fit(\n X_regr_train, y_regr_train, trials=Trials(), \n eval_set=[(X_regr_valid, y_regr_valid)], early_stopping_rounds=6, verbose=0\n)","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:51:19.844245Z","iopub.execute_input":"2022-01-01T11:51:19.844839Z","iopub.status.idle":"2022-01-01T11:52:27.830673Z","shell.execute_reply.started":"2022-01-01T11:51:19.844800Z","shell.execute_reply":"2022-01-01T11:52:27.829915Z"},"trusted":true},"execution_count":46,"outputs":[{"name":"stdout","text":"\n8 trials detected for ('max_depth', 'learning_rate', 'colsample_bytree')\n\ntrial: 0001 ### iterations: 00150 ### eval_score: 0.06508\ntrial: 0002 ### iterations: 00091 ### eval_score: 0.05997\ntrial: 0003 ### iterations: 00094 ### eval_score: 0.06078\ntrial: 0004 ### iterations: 00150 ### eval_score: 0.06773\ntrial: 0005 ### iterations: 00150 ### eval_score: 0.07565\ntrial: 0006 ### iterations: 00150 ### eval_score: 0.05935\ntrial: 0007 ### iterations: 00083 ### eval_score: 0.06047\ntrial: 0008 ### iterations: 00150 ### eval_score: 0.05966\n","output_type":"stream"},{"execution_count":46,"output_type":"execute_result","data":{"text/plain":"BoostRFA(estimator=LGBMRegressor(n_estimators=150, random_state=0),\n importance_type='shap_importances', min_features_to_select=1, n_iter=8,\n param_grid={'colsample_bytree': <hyperopt.pyll.base.Apply object at 0x7fd50407fd10>,\n 'learning_rate': <hyperopt.pyll.base.Apply object at 0x7fd50407fa50>,\n 'max_depth': <hyperopt.pyll.base.Apply object at 0x7fd50407f710>},\n sampling_seed=0, train_importance=False)"},"metadata":{}}]},{"cell_type":"code","source":"model.estimator_, model.best_params_, model.best_score_, model.n_features_","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:52:27.834402Z","iopub.execute_input":"2022-01-01T11:52:27.835864Z","iopub.status.idle":"2022-01-01T11:52:27.842813Z","shell.execute_reply.started":"2022-01-01T11:52:27.835812Z","shell.execute_reply":"2022-01-01T11:52:27.842095Z"},"trusted":true},"execution_count":47,"outputs":[{"execution_count":47,"output_type":"execute_result","data":{"text/plain":"(LGBMRegressor(colsample_bytree=0.7597292534356749,\n learning_rate=0.059836658149176665, max_depth=16,\n n_estimators=150, random_state=0),\n {'colsample_bytree': 0.7597292534356749,\n 'learning_rate': 0.059836658149176665,\n 'max_depth': 16},\n 0.059352961644604275,\n 9)"},"metadata":{}}]},{"cell_type":"code","source":"(model.score(X_regr_valid, y_regr_valid), \n model.predict(X_regr_valid).shape, \n model.transform(X_regr_valid).shape,\n model.predict(X_regr_valid, pred_contrib=True).shape)","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:52:27.844085Z","iopub.execute_input":"2022-01-01T11:52:27.844320Z","iopub.status.idle":"2022-01-01T11:52:28.690931Z","shell.execute_reply.started":"2022-01-01T11:52:27.844291Z","shell.execute_reply":"2022-01-01T11:52:28.690302Z"},"trusted":true},"execution_count":48,"outputs":[{"execution_count":48,"output_type":"execute_result","data":{"text/plain":"(0.7625808256692885, (1800,), (1800, 9), (1800, 10))"},"metadata":{}}]},{"cell_type":"markdown","source":"# CUSTOM EVAL METRIC SUPPORT","metadata":{}},{"cell_type":"code","source":"from sklearn.metrics import roc_auc_score\n\ndef AUC(y_true, y_hat):\n return 'auc', roc_auc_score(y_true, y_hat), True","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:52:28.691909Z","iopub.execute_input":"2022-01-01T11:52:28.692560Z","iopub.status.idle":"2022-01-01T11:52:28.696813Z","shell.execute_reply.started":"2022-01-01T11:52:28.692526Z","shell.execute_reply":"2022-01-01T11:52:28.696058Z"},"trusted":true},"execution_count":49,"outputs":[]},{"cell_type":"code","source":"model = BoostRFE(\n LGBMClassifier(n_estimators=150, random_state=0, metric=\"custom\"), \n param_grid=param_grid, min_features_to_select=1, step=1,\n greater_is_better=True\n)\nmodel.fit(\n X_clf_train, y_clf_train, \n eval_set=[(X_clf_valid, y_clf_valid)], early_stopping_rounds=6, verbose=0, \n eval_metric=AUC\n)","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:52:28.700234Z","iopub.execute_input":"2022-01-01T11:52:28.700461Z","iopub.status.idle":"2022-01-01T11:52:49.577997Z","shell.execute_reply.started":"2022-01-01T11:52:28.700433Z","shell.execute_reply":"2022-01-01T11:52:49.577317Z"},"trusted":true},"execution_count":50,"outputs":[{"name":"stdout","text":"\n8 trials detected for ('learning_rate', 'num_leaves', 'max_depth')\n\ntrial: 0001 ### iterations: 00028 ### eval_score: 0.97581\ntrial: 0002 ### iterations: 00016 ### eval_score: 0.97514\ntrial: 0003 ### iterations: 00015 ### eval_score: 0.97574\ntrial: 0004 ### iterations: 00032 ### eval_score: 0.97549\ntrial: 0005 ### iterations: 00075 ### eval_score: 0.97551\ntrial: 0006 ### iterations: 00041 ### eval_score: 0.97597\ntrial: 0007 ### iterations: 00076 ### eval_score: 0.97592\ntrial: 0008 ### iterations: 00060 ### eval_score: 0.97539\n","output_type":"stream"},{"execution_count":50,"output_type":"execute_result","data":{"text/plain":"BoostRFE(estimator=LGBMClassifier(metric='custom', n_estimators=150,\n random_state=0),\n greater_is_better=True, min_features_to_select=1,\n param_grid={'learning_rate': [0.2, 0.1], 'max_depth': [10, 12],\n 'num_leaves': [25, 35]})"},"metadata":{}}]},{"cell_type":"markdown","source":"# CATEGORICAL FEATURE SUPPORT","metadata":{}},{"cell_type":"code","source":"categorical_feature = [0,1,2]\n\nX_clf_train[:,categorical_feature] = (X_clf_train[:,categorical_feature]+100).clip(0).astype(int)\nX_clf_valid[:,categorical_feature] = (X_clf_valid[:,categorical_feature]+100).clip(0).astype(int)","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:52:49.581409Z","iopub.execute_input":"2022-01-01T11:52:49.581982Z","iopub.status.idle":"2022-01-01T11:52:49.589315Z","shell.execute_reply.started":"2022-01-01T11:52:49.581931Z","shell.execute_reply":"2022-01-01T11:52:49.588511Z"},"trusted":true},"execution_count":51,"outputs":[]},{"cell_type":"code","source":"### MANUAL PASS categorical_feature WITH NUMPY ARRAYS ###\n\nmodel = BoostRFE(clf_lgbm, param_grid=param_grid, min_features_to_select=1, step=1)\nmodel.fit(\n X_clf_train, y_clf_train, \n eval_set=[(X_clf_valid, y_clf_valid)], early_stopping_rounds=6, verbose=0,\n categorical_feature=categorical_feature\n)","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:52:49.590366Z","iopub.execute_input":"2022-01-01T11:52:49.590604Z","iopub.status.idle":"2022-01-01T11:53:00.495917Z","shell.execute_reply.started":"2022-01-01T11:52:49.590576Z","shell.execute_reply":"2022-01-01T11:53:00.495224Z"},"trusted":true},"execution_count":52,"outputs":[{"name":"stdout","text":"\n8 trials detected for ('learning_rate', 'num_leaves', 'max_depth')\n\ntrial: 0001 ### iterations: 00029 ### eval_score: 0.2036\ntrial: 0002 ### iterations: 00030 ### eval_score: 0.2034\ntrial: 0003 ### iterations: 00027 ### eval_score: 0.20617\ntrial: 0004 ### iterations: 00024 ### eval_score: 0.20003\ntrial: 0005 ### iterations: 00060 ### eval_score: 0.20332\ntrial: 0006 ### iterations: 00063 ### eval_score: 0.20329\ntrial: 0007 ### iterations: 00054 ### eval_score: 0.20136\ntrial: 0008 ### iterations: 00052 ### eval_score: 0.19959\n","output_type":"stream"},{"execution_count":52,"output_type":"execute_result","data":{"text/plain":"BoostRFE(estimator=LGBMClassifier(n_estimators=150, random_state=0),\n min_features_to_select=1,\n param_grid={'learning_rate': [0.2, 0.1], 'max_depth': [10, 12],\n 'num_leaves': [25, 35]})"},"metadata":{}}]},{"cell_type":"code","source":"X_clf_train = pd.DataFrame(X_clf_train)\nX_clf_train[categorical_feature] = X_clf_train[categorical_feature].astype('category')\n\nX_clf_valid = pd.DataFrame(X_clf_valid)\nX_clf_valid[categorical_feature] = X_clf_valid[categorical_feature].astype('category')","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:53:00.499198Z","iopub.execute_input":"2022-01-01T11:53:00.499858Z","iopub.status.idle":"2022-01-01T11:53:00.527402Z","shell.execute_reply.started":"2022-01-01T11:53:00.499814Z","shell.execute_reply":"2022-01-01T11:53:00.526779Z"},"trusted":true},"execution_count":53,"outputs":[]},{"cell_type":"code","source":"### PASS category COLUMNS IN PANDAS DF ###\n\nmodel = BoostRFE(clf_lgbm, param_grid=param_grid, min_features_to_select=1, step=1)\nmodel.fit(X_clf_train, y_clf_train, eval_set=[(X_clf_valid, y_clf_valid)], early_stopping_rounds=6, verbose=0)","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:53:00.529027Z","iopub.execute_input":"2022-01-01T11:53:00.529320Z","iopub.status.idle":"2022-01-01T11:53:12.422092Z","shell.execute_reply.started":"2022-01-01T11:53:00.529281Z","shell.execute_reply":"2022-01-01T11:53:12.421368Z"},"trusted":true},"execution_count":54,"outputs":[{"name":"stdout","text":"\n8 trials detected for ('learning_rate', 'num_leaves', 'max_depth')\n\ntrial: 0001 ### iterations: 00029 ### eval_score: 0.2036\ntrial: 0002 ### iterations: 00030 ### eval_score: 0.2034\ntrial: 0003 ### iterations: 00027 ### eval_score: 0.20617\ntrial: 0004 ### iterations: 00024 ### eval_score: 0.20003\ntrial: 0005 ### iterations: 00060 ### eval_score: 0.20332\ntrial: 0006 ### iterations: 00063 ### eval_score: 0.20329\ntrial: 0007 ### iterations: 00054 ### eval_score: 0.20136\ntrial: 0008 ### iterations: 00052 ### eval_score: 0.19959\n","output_type":"stream"},{"execution_count":54,"output_type":"execute_result","data":{"text/plain":"BoostRFE(estimator=LGBMClassifier(n_estimators=150, random_state=0),\n min_features_to_select=1,\n param_grid={'learning_rate': [0.2, 0.1], 'max_depth': [10, 12],\n 'num_leaves': [25, 35]})"},"metadata":{}}]}]}
================================================
FILE: notebooks/XGBoost_usage.ipynb
================================================
{"metadata":{"kernelspec":{"language":"python","display_name":"Python 3","name":"python3"},"language_info":{"name":"python","version":"3.7.12","mimetype":"text/x-python","codemirror_mode":{"name":"ipython","version":3},"pygments_lexer":"ipython3","nbconvert_exporter":"python","file_extension":".py"}},"nbformat_minor":4,"nbformat":4,"cells":[{"cell_type":"code","source":"import numpy as np\nimport pandas as pd\nfrom scipy import stats\n\nfrom sklearn.model_selection import train_test_split\nfrom sklearn.datasets import make_classification, make_regression\n\nfrom hyperopt import hp\nfrom hyperopt import Trials\n\nfrom xgboost import *\n\ntry:\n from shaphypetune import BoostSearch, BoostBoruta, BoostRFE, BoostRFA\nexcept:\n !pip install --upgrade shap-hypetune\n from shaphypetune import BoostSearch, BoostBoruta, BoostRFE, BoostRFA\n\nimport warnings\nwarnings.simplefilter('ignore')","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:49:44.031173Z","iopub.execute_input":"2022-01-01T11:49:44.031497Z","iopub.status.idle":"2022-01-01T11:49:45.071830Z","shell.execute_reply.started":"2022-01-01T11:49:44.031410Z","shell.execute_reply":"2022-01-01T11:49:45.070928Z"},"trusted":true},"execution_count":1,"outputs":[]},{"cell_type":"code","source":"X_clf, y_clf = make_classification(n_samples=6000, n_features=20, n_classes=2, \n n_informative=4, n_redundant=6, random_state=0)\n\nX_clf_train, X_clf_valid, y_clf_train, y_clf_valid = train_test_split(\n X_clf, y_clf, test_size=0.3, shuffle=False)\n\nX_regr, y_regr = make_classification(n_samples=6000, n_features=20,\n n_informative=7, random_state=0)\n\nX_regr_train, X_regr_valid, y_regr_train, y_regr_valid = train_test_split(\n X_regr, y_regr, test_size=0.3, shuffle=False)","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:49:45.073832Z","iopub.execute_input":"2022-01-01T11:49:45.074046Z","iopub.status.idle":"2022-01-01T11:49:45.098178Z","shell.execute_reply.started":"2022-01-01T11:49:45.074004Z","shell.execute_reply":"2022-01-01T11:49:45.097461Z"},"trusted":true},"execution_count":2,"outputs":[]},{"cell_type":"code","source":"param_grid = {\n 'learning_rate': [0.2, 0.1],\n 'num_leaves': [25, 35],\n 'max_depth': [10, 12]\n}\n\nparam_dist = {\n 'learning_rate': stats.uniform(0.09, 0.25),\n 'num_leaves': stats.randint(20,40),\n 'max_depth': [10, 12]\n}\n\nparam_dist_hyperopt = {\n 'max_depth': 15 + hp.randint('num_leaves', 5), \n 'learning_rate': hp.loguniform('learning_rate', np.log(0.01), np.log(0.2)),\n 'colsample_bytree': hp.uniform('colsample_by_tree', 0.6, 1.0)\n}\n\n\nregr_xgb = XGBRegressor(n_estimators=150, random_state=0, verbosity=0, n_jobs=-1)\nclf_xgb = XGBClassifier(n_estimators=150, random_state=0, verbosity=0, n_jobs=-1)","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:49:45.099715Z","iopub.execute_input":"2022-01-01T11:49:45.099916Z","iopub.status.idle":"2022-01-01T11:49:45.108765Z","shell.execute_reply.started":"2022-01-01T11:49:45.099890Z","shell.execute_reply":"2022-01-01T11:49:45.107996Z"},"trusted":true},"execution_count":3,"outputs":[]},{"cell_type":"markdown","source":"# Hyperparameters Tuning","metadata":{}},{"cell_type":"code","source":"### HYPERPARAM TUNING WITH GRID-SEARCH ###\n\nmodel = BoostSearch(clf_xgb, param_grid=param_grid)\nmodel.fit(X_clf_train, y_clf_train, eval_set=[(X_clf_valid, y_clf_valid)], early_stopping_rounds=6, verbose=0)","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:49:45.109686Z","iopub.execute_input":"2022-01-01T11:49:45.109871Z","iopub.status.idle":"2022-01-01T11:49:52.490942Z","shell.execute_reply.started":"2022-01-01T11:49:45.109848Z","shell.execute_reply":"2022-01-01T11:49:52.490078Z"},"trusted":true},"execution_count":4,"outputs":[{"name":"stdout","text":"\n8 trials detected for ('learning_rate', 'num_leaves', 'max_depth')\n\ntrial: 0001 ### iterations: 00021 ### eval_score: 0.2045\ntrial: 0002 ### iterations: 00026 ### eval_score: 0.19472\ntrial: 0003 ### iterations: 00021 ### eval_score: 0.2045\ntrial: 0004 ### iterations: 00026 ### eval_score: 0.19472\ntrial: 0005 ### iterations: 00045 ### eval_score: 0.19964\ntrial: 0006 ### iterations: 00050 ### eval_score: 0.20157\ntrial: 0007 ### iterations: 00045 ### eval_score: 0.19964\ntrial: 0008 ### iterations: 00050 ### eval_score: 0.20157\n","output_type":"stream"},{"execution_count":4,"output_type":"execute_result","data":{"text/plain":"BoostSearch(estimator=XGBClassifier(base_score=None, booster=None,\n colsample_bylevel=None,\n colsample_bynode=None,\n colsample_bytree=None,\n enable_categorical=False, gamma=None,\n gpu_id=None, importance_type=None,\n interaction_constraints=None,\n learning_rate=None, max_delta_step=None,\n max_depth=None, min_child_weight=None,\n missing=nan, monotone_constraints=None,\n n_estimators=150, n_jobs=-1,\n num_parallel_tree=None, predictor=None,\n random_state=0, reg_alpha=None,\n reg_lambda=None, scale_pos_weight=None,\n subsample=None, tree_method=None,\n validate_parameters=None, verbosity=0),\n param_grid={'learning_rate': [0.2, 0.1], 'max_depth': [10, 12],\n 'num_leaves': [25, 35]})"},"metadata":{}}]},{"cell_type":"code","source":"model.estimator_, model.best_params_, model.best_score_","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:49:52.493607Z","iopub.execute_input":"2022-01-01T11:49:52.494126Z","iopub.status.idle":"2022-01-01T11:49:52.504649Z","shell.execute_reply.started":"2022-01-01T11:49:52.494081Z","shell.execute_reply":"2022-01-01T11:49:52.503849Z"},"trusted":true},"execution_count":5,"outputs":[{"execution_count":5,"output_type":"execute_result","data":{"text/plain":"(XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,\n colsample_bynode=1, colsample_bytree=1, enable_categorical=False,\n gamma=0, gpu_id=-1, importance_type=None,\n interaction_constraints='', learning_rate=0.2, max_delta_step=0,\n max_depth=12, min_child_weight=1, missing=nan,\n monotone_constraints='()', n_estimators=150, n_jobs=-1,\n num_leaves=25, num_parallel_tree=1, predictor='auto',\n random_state=0, reg_alpha=0, reg_lambda=1, scale_pos_weight=1,\n subsample=1, tree_method='exact', validate_parameters=1,\n verbosity=0),\n {'learning_rate': 0.2, 'num_leaves': 25, 'max_depth': 12},\n 0.194719)"},"metadata":{}}]},{"cell_type":"code","source":"(model.score(X_clf_valid, y_clf_valid), \n model.predict(X_clf_valid).shape, \n model.predict_proba(X_clf_valid).shape)","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:49:52.506201Z","iopub.execute_input":"2022-01-01T11:49:52.506365Z","iopub.status.idle":"2022-01-01T11:49:52.528604Z","shell.execute_reply.started":"2022-01-01T11:49:52.506344Z","shell.execute_reply":"2022-01-01T11:49:52.528078Z"},"trusted":true},"execution_count":6,"outputs":[{"execution_count":6,"output_type":"execute_result","data":{"text/plain":"(0.9138888888888889, (1800,), (1800, 2))"},"metadata":{}}]},{"cell_type":"code","source":"### HYPERPARAM TUNING WITH RANDOM-SEARCH ###\n\nmodel = BoostSearch(\n regr_xgb, param_grid=param_dist,\n n_iter=8, sampling_seed=0\n)\nmodel.fit(X_regr_train, y_regr_train, eval_set=[(X_regr_valid, y_regr_valid)], early_stopping_rounds=6, verbose=0)","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:49:52.529476Z","iopub.execute_input":"2022-01-01T11:49:52.530097Z","iopub.status.idle":"2022-01-01T11:50:03.018637Z","shell.execute_reply.started":"2022-01-01T11:49:52.530066Z","shell.execute_reply":"2022-01-01T11:50:03.017927Z"},"trusted":true},"execution_count":7,"outputs":[{"name":"stdout","text":"\n8 trials detected for ('learning_rate', 'num_leaves', 'max_depth')\n\ntrial: 0001 ### iterations: 00012 ### eval_score: 0.27616\ntrial: 0002 ### iterations: 00056 ### eval_score: 0.26211\ntrial: 0003 ### iterations: 00078 ### eval_score: 0.27603\ntrial: 0004 ### iterations: 00045 ### eval_score: 0.26117\ntrial: 0005 ### iterations: 00046 ### eval_score: 0.27868\ntrial: 0006 ### iterations: 00035 ### eval_score: 0.27815\ntrial: 0007 ### iterations: 00039 ### eval_score: 0.2753\ntrial: 0008 ### iterations: 00016 ### eval_score: 0.28116\n","output_type":"stream"},{"execution_count":7,"output_type":"execute_result","data":{"text/plain":"BoostSearch(estimator=XGBRegressor(base_score=None, booster=None,\n colsample_bylevel=None,\n colsample_bynode=None, colsample_bytree=None,\n enable_categorical=False, gamma=None,\n gpu_id=None, importance_type=None,\n interaction_constraints=None,\n learning_rate=None, max_delta_step=None,\n max_depth=None, min_child_weight=None,\n missing=nan, monotone_constraints=None,\n n_estim...\n random_state=0, reg_alpha=None,\n reg_lambda=None, scale_pos_weight=None,\n subsample=None, tree_method=None,\n validate_parameters=None, verbosity=0),\n n_iter=8,\n param_grid={'learning_rate': <scipy.stats._distn_infrastructure.rv_frozen object at 0x7f5d29426490>,\n 'max_depth': [10, 12],\n 'num_leaves': <scipy.stats._distn_infrastructure.rv_frozen object at 0x7f5d29426710>},\n sampling_seed=0)"},"metadata":{}}]},{"cell_type":"code","source":"model.estimator_, model.best_params_, model.best_score_","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:50:03.019747Z","iopub.execute_input":"2022-01-01T11:50:03.020416Z","iopub.status.idle":"2022-01-01T11:50:03.030730Z","shell.execute_reply.started":"2022-01-01T11:50:03.020379Z","shell.execute_reply":"2022-01-01T11:50:03.030065Z"},"trusted":true},"execution_count":8,"outputs":[{"execution_count":8,"output_type":"execute_result","data":{"text/plain":"(XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,\n colsample_bynode=1, colsample_bytree=1, enable_categorical=False,\n gamma=0, gpu_id=-1, importance_type=None,\n interaction_constraints='', learning_rate=0.1669837381562427,\n max_delta_step=0, max_depth=10, min_child_weight=1, missing=nan,\n monotone_constraints='()', n_estimators=150, n_jobs=-1,\n num_leaves=25, num_parallel_tree=1, predictor='auto',\n random_state=0, reg_alpha=0, reg_lambda=1, scale_pos_weight=1,\n subsample=1, tree_method='exact', validate_parameters=1,\n verbosity=0),\n {'learning_rate': 0.1669837381562427, 'num_leaves': 25, 'max_depth': 10},\n 0.26117)"},"metadata":{}}]},{"cell_type":"code","source":"(model.score(X_regr_valid, y_regr_valid), \n model.predict(X_regr_valid).shape)","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:50:03.032146Z","iopub.execute_input":"2022-01-01T11:50:03.032612Z","iopub.status.idle":"2022-01-01T11:50:03.058721Z","shell.execute_reply.started":"2022-01-01T11:50:03.032572Z","shell.execute_reply":"2022-01-01T11:50:03.058084Z"},"trusted":true},"execution_count":9,"outputs":[{"execution_count":9,"output_type":"execute_result","data":{"text/plain":"(0.7271524639165458, (1800,))"},"metadata":{}}]},{"cell_type":"code","source":"### HYPERPARAM TUNING WITH HYPEROPT ###\n\nmodel = BoostSearch(\n regr_xgb, param_grid=param_dist_hyperopt,\n n_iter=8, sampling_seed=0\n)\nmodel.fit(\n X_regr_train, y_regr_train, trials=Trials(), \n eval_set=[(X_regr_valid, y_regr_valid)], early_stopping_rounds=6, verbose=0\n)","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:50:03.059800Z","iopub.execute_input":"2022-01-01T11:50:03.062204Z","iopub.status.idle":"2022-01-01T11:50:32.323625Z","shell.execute_reply.started":"2022-01-01T11:50:03.062158Z","shell.execute_reply":"2022-01-01T11:50:32.322789Z"},"trusted":true},"execution_count":10,"outputs":[{"name":"stdout","text":"\n8 trials detected for ('max_depth', 'learning_rate', 'colsample_bytree')\n\ntrial: 0001 ### iterations: 00149 ### eval_score: 0.27498\ntrial: 0002 ### iterations: 00074 ### eval_score: 0.27186\ntrial: 0003 ### iterations: 00038 ### eval_score: 0.28326\ntrial: 0004 ### iterations: 00149 ### eval_score: 0.29455\ntrial: 0005 ### iterations: 00149 ### eval_score: 0.28037\ntrial: 0006 ### iterations: 00149 ### eval_score: 0.26421\ntrial: 0007 ### iterations: 00052 ### eval_score: 0.27191\ntrial: 0008 ### iterations: 00133 ### eval_score: 0.29251\n","output_type":"stream"},{"execution_count":10,"output_type":"execute_result","data":{"text/plain":"BoostSearch(estimator=XGBRegressor(base_score=None, booster=None,\n colsample_bylevel=None,\n colsample_bynode=None, colsample_bytree=None,\n enable_categorical=False, gamma=None,\n gpu_id=None, importance_type=None,\n interaction_constraints=None,\n learning_rate=None, max_delta_step=None,\n max_depth=None, min_child_weight=None,\n missing=nan, monotone_constraints=None,\n n_estim...\n random_state=0, reg_alpha=None,\n reg_lambda=None, scale_pos_weight=None,\n subsample=None, tree_method=None,\n validate_parameters=None, verbosity=0),\n n_iter=8,\n param_grid={'colsample_bytree': <hyperopt.pyll.base.Apply object at 0x7f5d29426890>,\n 'learning_rate': <hyperopt.pyll.base.Apply object at 0x7f5d29426a50>,\n 'max_depth': <hyperopt.pyll.base.Apply object at 0x7f5d29426790>},\n sampling_seed=0)"},"metadata":{}}]},{"cell_type":"code","source":"model.estimator_, model.best_params_, model.best_score_","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:50:32.324994Z","iopub.execute_input":"2022-01-01T11:50:32.325480Z","iopub.status.idle":"2022-01-01T11:50:32.335828Z","shell.execute_reply.started":"2022-01-01T11:50:32.325441Z","shell.execute_reply":"2022-01-01T11:50:32.334970Z"},"trusted":true},"execution_count":11,"outputs":[{"execution_count":11,"output_type":"execute_result","data":{"text/plain":"(XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,\n colsample_bynode=1, colsample_bytree=0.7597292534356749,\n enable_categorical=False, gamma=0, gpu_id=-1, importance_type=None,\n interaction_constraints='', learning_rate=0.059836658149176665,\n max_delta_step=0, max_depth=16, min_child_weight=1, missing=nan,\n monotone_constraints='()', n_estimators=150, n_jobs=-1,\n num_parallel_tree=1, predictor='auto', random_state=0, reg_alpha=0,\n reg_lambda=1, scale_pos_weight=1, subsample=1, tree_method='exact',\n validate_parameters=1, verbosity=0),\n {'colsample_bytree': 0.7597292534356749,\n 'learning_rate': 0.059836658149176665,\n 'max_depth': 16},\n 0.264211)"},"metadata":{}}]},{"cell_type":"code","source":"(model.score(X_regr_valid, y_regr_valid), \n model.predict(X_regr_valid).shape)","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:50:32.337011Z","iopub.execute_input":"2022-01-01T11:50:32.337395Z","iopub.status.idle":"2022-01-01T11:50:32.370381Z","shell.execute_reply.started":"2022-01-01T11:50:32.337369Z","shell.execute_reply":"2022-01-01T11:50:32.369816Z"},"trusted":true},"execution_count":12,"outputs":[{"execution_count":12,"output_type":"execute_result","data":{"text/plain":"(0.7207605727361562, (1800,))"},"metadata":{}}]},{"cell_type":"markdown","source":"# Features Selection","metadata":{}},{"cell_type":"code","source":"### BORUTA ###\n\nmodel = BoostBoruta(clf_xgb, max_iter=200, perc=100)\nmodel.fit(X_clf_train, y_clf_train, eval_set=[(X_clf_valid, y_clf_valid)], early_stopping_rounds=6, verbose=0)","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:50:32.371634Z","iopub.execute_input":"2022-01-01T11:50:32.372109Z","iopub.status.idle":"2022-01-01T11:50:50.797541Z","shell.execute_reply.started":"2022-01-01T11:50:32.372066Z","shell.execute_reply":"2022-01-01T11:50:50.797059Z"},"trusted":true},"execution_count":13,"outputs":[{"execution_count":13,"output_type":"execute_result","data":{"text/plain":"BoostBoruta(estimator=XGBClassifier(base_score=None, booster=None,\n colsample_bylevel=None,\n colsample_bynode=None,\n colsample_bytree=None,\n enable_categorical=False, gamma=None,\n gpu_id=None, importance_type=None,\n interaction_constraints=None,\n learning_rate=None, max_delta_step=None,\n max_depth=None, min_child_weight=None,\n missing=nan, monotone_constraints=None,\n n_estimators=150, n_jobs=-1,\n num_parallel_tree=None, predictor=None,\n random_state=0, reg_alpha=None,\n reg_lambda=None, scale_pos_weight=None,\n subsample=None, tree_method=None,\n validate_parameters=None, verbosity=0),\n max_iter=200)"},"metadata":{}}]},{"cell_type":"code","source":"model.estimator_, model.n_features_","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:50:50.800394Z","iopub.execute_input":"2022-01-01T11:50:50.800795Z","iopub.status.idle":"2022-01-01T11:50:50.809566Z","shell.execute_reply.started":"2022-01-01T11:50:50.800767Z","shell.execute_reply":"2022-01-01T11:50:50.808911Z"},"trusted":true},"execution_count":14,"outputs":[{"execution_count":14,"output_type":"execute_result","data":{"text/plain":"(XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,\n colsample_bynode=1, colsample_bytree=1, enable_categorical=False,\n gamma=0, gpu_id=-1, importance_type=None,\n interaction_constraints='', learning_rate=0.300000012,\n max_delta_step=0, max_depth=6, min_child_weight=1, missing=nan,\n monotone_constraints='()', n_estimators=150, n_jobs=-1,\n num_parallel_tree=1, predictor='auto', random_state=0,\n reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1,\n tree_method='exact', validate_parameters=1, verbosity=0),\n 11)"},"metadata":{}}]},{"cell_type":"code","source":"(model.score(X_clf_valid, y_clf_valid), \n model.predict(X_clf_valid).shape, \n model.transform(X_clf_valid).shape,\n model.predict_proba(X_clf_valid).shape)","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:50:50.810633Z","iopub.execute_input":"2022-01-01T11:50:50.811078Z","iopub.status.idle":"2022-01-01T11:50:50.834426Z","shell.execute_reply.started":"2022-01-01T11:50:50.811040Z","shell.execute_reply":"2022-01-01T11:50:50.833776Z"},"trusted":true},"execution_count":15,"outputs":[{"execution_count":15,"output_type":"execute_result","data":{"text/plain":"(0.9161111111111111, (1800,), (1800, 11), (1800, 2))"},"metadata":{}}]},{"cell_type":"code","source":"### RECURSIVE FEATURE ELIMINATION (RFE) ###\n\nmodel = BoostRFE(regr_xgb, min_features_to_select=1, step=1)\nmodel.fit(X_regr_train, y_regr_train, eval_set=[(X_regr_valid, y_regr_valid)], early_stopping_rounds=6, verbose=0)","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:50:50.835608Z","iopub.execute_input":"2022-01-01T11:50:50.836142Z","iopub.status.idle":"2022-01-01T11:50:58.558180Z","shell.execute_reply.started":"2022-01-01T11:50:50.836100Z","shell.execute_reply":"2022-01-01T11:50:58.557365Z"},"trusted":true},"execution_count":16,"outputs":[{"execution_count":16,"output_type":"execute_result","data":{"text/plain":"BoostRFE(estimator=XGBRegressor(base_score=None, booster=None,\n colsample_bylevel=None, colsample_bynode=None,\n colsample_bytree=None, enable_categorical=False,\n gamma=None, gpu_id=None, importance_type=None,\n interaction_constraints=None,\n learning_rate=None, max_delta_step=None,\n max_depth=None, min_child_weight=None,\n missing=nan, monotone_constraints=None,\n n_estimators=150, n_jobs=-1,\n num_parallel_tree=None, predictor=None,\n random_state=0, reg_alpha=None, reg_lambda=None,\n scale_pos_weight=None, subsample=None,\n tree_method=None, validate_parameters=None,\n verbosity=0),\n min_features_to_select=1)"},"metadata":{}}]},{"cell_type":"code","source":"model.estimator_, model.n_features_","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:50:58.559585Z","iopub.execute_input":"2022-01-01T11:50:58.560110Z","iopub.status.idle":"2022-01-01T11:50:58.569301Z","shell.execute_reply.started":"2022-01-01T11:50:58.560048Z","shell.execute_reply":"2022-01-01T11:50:58.568542Z"},"trusted":true},"execution_count":17,"outputs":[{"execution_count":17,"output_type":"execute_result","data":{"text/plain":"(XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,\n colsample_bynode=1, colsample_bytree=1, enable_categorical=False,\n gamma=0, gpu_id=-1, importance_type=None,\n interaction_constraints='', learning_rate=0.300000012,\n max_delta_step=0, max_depth=6, min_child_weight=1, missing=nan,\n monotone_constraints='()', n_estimators=150, n_jobs=-1,\n num_parallel_tree=1, predictor='auto', random_state=0, reg_alpha=0,\n reg_lambda=1, scale_pos_weight=1, subsample=1, tree_method='exact',\n validate_parameters=1, verbosity=0),\n 7)"},"metadata":{}}]},{"cell_type":"code","source":"(model.score(X_regr_valid, y_regr_valid), \n model.predict(X_regr_valid).shape, \n model.transform(X_regr_valid).shape)","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:50:58.570558Z","iopub.execute_input":"2022-01-01T11:50:58.570828Z","iopub.status.idle":"2022-01-01T11:50:58.584624Z","shell.execute_reply.started":"2022-01-01T11:50:58.570792Z","shell.execute_reply":"2022-01-01T11:50:58.584081Z"},"trusted":true},"execution_count":18,"outputs":[{"execution_count":18,"output_type":"execute_result","data":{"text/plain":"(0.7317444492376407, (1800,), (1800, 7))"},"metadata":{}}]},{"cell_type":"code","source":"### RECURSIVE FEATURE ADDITION (RFA) ###\n\nmodel = BoostRFA(regr_xgb, min_features_to_select=1, step=1)\nmodel.fit(X_regr_train, y_regr_train, eval_set=[(X_regr_valid, y_regr_valid)], early_stopping_rounds=6, verbose=0)","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:50:58.585749Z","iopub.execute_input":"2022-01-01T11:50:58.586163Z","iopub.status.idle":"2022-01-01T11:51:09.404587Z","shell.execute_reply.started":"2022-01-01T11:50:58.586126Z","shell.execute_reply":"2022-01-01T11:51:09.403781Z"},"trusted":true},"execution_count":19,"outputs":[{"execution_count":19,"output_type":"execute_result","data":{"text/plain":"BoostRFA(estimator=XGBRegressor(base_score=None, booster=None,\n colsample_bylevel=None, colsample_bynode=None,\n colsample_bytree=None, enable_categorical=False,\n gamma=None, gpu_id=None, importance_type=None,\n interaction_constraints=None,\n learning_rate=None, max_delta_step=None,\n max_depth=None, min_child_weight=None,\n missing=nan, monotone_constraints=None,\n n_estimators=150, n_jobs=-1,\n num_parallel_tree=None, predictor=None,\n random_state=0, reg_alpha=None, reg_lambda=None,\n scale_pos_weight=None, subsample=None,\n tree_method=None, validate_parameters=None,\n verbosity=0),\n min_features_to_select=1)"},"metadata":{}}]},{"cell_type":"code","source":"model.estimator_, model.n_features_","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:51:09.406057Z","iopub.execute_input":"2022-01-01T11:51:09.406434Z","iopub.status.idle":"2022-01-01T11:51:09.416068Z","shell.execute_reply.started":"2022-01-01T11:51:09.406399Z","shell.execute_reply":"2022-01-01T11:51:09.415411Z"},"trusted":true},"execution_count":20,"outputs":[{"execution_count":20,"output_type":"execute_result","data":{"text/plain":"(XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,\n colsample_bynode=1, colsample_bytree=1, enable_categorical=False,\n gamma=0, gpu_id=-1, importance_type=None,\n interaction_constraints='', learning_rate=0.300000012,\n max_delta_step=0, max_depth=6, min_child_weight=1, missing=nan,\n monotone_constraints='()', n_estimators=150, n_jobs=-1,\n num_parallel_tree=1, predictor='auto', random_state=0, reg_alpha=0,\n reg_lambda=1, scale_pos_weight=1, subsample=1, tree_method='exact',\n validate_parameters=1, verbosity=0),\n 8)"},"metadata":{}}]},{"cell_type":"code","source":"(model.score(X_regr_valid, y_regr_valid), \n model.predict(X_regr_valid).shape, \n model.transform(X_regr_valid).shape)","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:51:09.417248Z","iopub.execute_input":"2022-01-01T11:51:09.417698Z","iopub.status.idle":"2022-01-01T11:51:09.450280Z","shell.execute_reply.started":"2022-01-01T11:51:09.417657Z","shell.execute_reply":"2022-01-01T11:51:09.449664Z"},"trusted":true},"execution_count":21,"outputs":[{"execution_count":21,"output_type":"execute_result","data":{"text/plain":"(0.7274037362877257, (1800,), (1800, 8))"},"metadata":{}}]},{"cell_type":"markdown","source":"# Features Selection with SHAP","metadata":{}},{"cell_type":"code","source":"### BORUTA SHAP ###\n\nmodel = BoostBoruta(\n clf_xgb, max_iter=200, perc=100,\n importance_type='shap_importances', train_importance=False\n)\nmodel.fit(X_clf_train, y_clf_train, eval_set=[(X_clf_valid, y_clf_valid)], early_stopping_rounds=6, verbose=0)","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:51:09.451169Z","iopub.execute_input":"2022-01-01T11:51:09.451507Z","iopub.status.idle":"2022-01-01T11:51:33.925757Z","shell.execute_reply.started":"2022-01-01T11:51:09.451482Z","shell.execute_reply":"2022-01-01T11:51:33.925076Z"},"trusted":true},"execution_count":22,"outputs":[{"execution_count":22,"output_type":"execute_result","data":{"text/plain":"BoostBoruta(estimator=XGBClassifier(base_score=None, booster=None,\n colsample_bylevel=None,\n colsample_bynode=None,\n colsample_bytree=None,\n enable_categorical=False, gamma=None,\n gpu_id=None, importance_type=None,\n interaction_constraints=None,\n learning_rate=None, max_delta_step=None,\n max_depth=None, min_child_weight=None,\n missing=nan, monotone_constraints=None,\n n_estimators=150, n_jobs=-1,\n num_parallel_tree=None, predictor=None,\n random_state=0, reg_alpha=None,\n reg_lambda=None, scale_pos_weight=None,\n subsample=None, tree_method=None,\n validate_parameters=None, verbosity=0),\n importance_type='shap_importances', max_iter=200,\n train_importance=False)"},"metadata":{}}]},{"cell_type":"code","source":"model.estimator_, model.n_features_","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:51:33.926762Z","iopub.execute_input":"2022-01-01T11:51:33.926940Z","iopub.status.idle":"2022-01-01T11:51:33.934907Z","shell.execute_reply.started":"2022-01-01T11:51:33.926918Z","shell.execute_reply":"2022-01-01T11:51:33.934315Z"},"trusted":true},"execution_count":23,"outputs":[{"execution_count":23,"output_type":"execute_result","data":{"text/plain":"(XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,\n colsample_bynode=1, colsample_bytree=1, enable_categorical=False,\n gamma=0, gpu_id=-1, importance_type=None,\n interaction_constraints='', learning_rate=0.300000012,\n max_delta_step=0, max_depth=6, min_child_weight=1, missing=nan,\n monotone_constraints='()', n_estimators=150, n_jobs=-1,\n num_parallel_tree=1, predictor='auto', random_state=0,\n reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1,\n tree_method='exact', validate_parameters=1, verbosity=0),\n 10)"},"metadata":{}}]},{"cell_type":"code","source":"(model.score(X_clf_valid, y_clf_valid), \n model.predict(X_clf_valid).shape, \n model.transform(X_clf_valid).shape,\n model.predict_proba(X_clf_valid).shape)","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:51:33.935950Z","iopub.execute_input":"2022-01-01T11:51:33.936419Z","iopub.status.idle":"2022-01-01T11:51:33.961319Z","shell.execute_reply.started":"2022-01-01T11:51:33.936381Z","shell.execute_reply":"2022-01-01T11:51:33.960533Z"},"trusted":true},"execution_count":24,"outputs":[{"execution_count":24,"output_type":"execute_result","data":{"text/plain":"(0.91, (1800,), (1800, 10), (1800, 2))"},"metadata":{}}]},{"cell_type":"code","source":"### RECURSIVE FEATURE ELIMINATION (RFE) SHAP ###\n\nmodel = BoostRFE(\n regr_xgb, min_features_to_select=1, step=1,\n importance_type='shap_importances', train_importance=False\n)\nmodel.fit(X_regr_train, y_regr_train, eval_set=[(X_regr_valid, y_regr_valid)], early_stopping_rounds=6, verbose=0)","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:51:33.962369Z","iopub.execute_input":"2022-01-01T11:51:33.962555Z","iopub.status.idle":"2022-01-01T11:51:47.059712Z","shell.execute_reply.started":"2022-01-01T11:51:33.962532Z","shell.execute_reply":"2022-01-01T11:51:47.058892Z"},"trusted":true},"execution_count":25,"outputs":[{"execution_count":25,"output_type":"execute_result","data":{"text/plain":"BoostRFE(estimator=XGBRegressor(base_score=None, booster=None,\n colsample_bylevel=None, colsample_bynode=None,\n colsample_bytree=None, enable_categorical=False,\n gamma=None, gpu_id=None, importance_type=None,\n interaction_constraints=None,\n learning_rate=None, max_delta_step=None,\n max_depth=None, min_child_weight=None,\n missing=nan, monotone_constraints=None,\n n_estimators=150, n_jobs=-1,\n num_parallel_tree=None, predictor=None,\n random_state=0, reg_alpha=None, reg_lambda=None,\n scale_pos_weight=None, subsample=None,\n tree_method=None, validate_parameters=None,\n verbosity=0),\n importance_type='shap_importances', min_features_to_select=1,\n train_importance=False)"},"metadata":{}}]},{"cell_type":"code","source":"model.estimator_, model.n_features_","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:51:47.060847Z","iopub.execute_input":"2022-01-01T11:51:47.061090Z","iopub.status.idle":"2022-01-01T11:51:47.069229Z","shell.execute_reply.started":"2022-01-01T11:51:47.061061Z","shell.execute_reply":"2022-01-01T11:51:47.068462Z"},"trusted":true},"execution_count":26,"outputs":[{"execution_count":26,"output_type":"execute_result","data":{"text/plain":"(XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,\n colsample_bynode=1, colsample_bytree=1, enable_categorical=False,\n gamma=0, gpu_id=-1, importance_type=None,\n interaction_constraints='', learning_rate=0.300000012,\n max_delta_step=0, max_depth=6, min_child_weight=1, missing=nan,\n monotone_constraints='()', n_estimators=150, n_jobs=-1,\n num_parallel_tree=1, predictor='auto', random_state=0, reg_alpha=0,\n reg_lambda=1, scale_pos_weight=1, subsample=1, tree_method='exact',\n validate_parameters=1, verbosity=0),\n 7)"},"metadata":{}}]},{"cell_type":"code","source":"(model.score(X_regr_valid, y_regr_valid), \n model.predict(X_regr_valid).shape, \n model.transform(X_regr_valid).shape)","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:51:47.070353Z","iopub.execute_input":"2022-01-01T11:51:47.071217Z","iopub.status.idle":"2022-01-01T11:51:47.087333Z","shell.execute_reply.started":"2022-01-01T11:51:47.071168Z","shell.execute_reply":"2022-01-01T11:51:47.086754Z"},"trusted":true},"execution_count":27,"outputs":[{"execution_count":27,"output_type":"execute_result","data":{"text/plain":"(0.7317444492376407, (1800,), (1800, 7))"},"metadata":{}}]},{"cell_type":"code","source":"### RECURSIVE FEATURE ADDITION (RFA) SHAP ###\n\nmodel = BoostRFA(\n regr_xgb, min_features_to_select=1, step=1,\n importance_type='shap_importances', train_importance=False\n)\nmodel.fit(\n X_regr_train, y_regr_train, trials=Trials(), \n eval_set=[(X_regr_valid, y_regr_valid)], early_stopping_rounds=6, verbose=0\n)","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:51:47.088455Z","iopub.execute_input":"2022-01-01T11:51:47.088921Z","iopub.status.idle":"2022-01-01T11:51:59.186202Z","shell.execute_reply.started":"2022-01-01T11:51:47.088885Z","shell.execute_reply":"2022-01-01T11:51:59.185431Z"},"trusted":true},"execution_count":28,"outputs":[{"execution_count":28,"output_type":"execute_result","data":{"text/plain":"BoostRFA(estimator=XGBRegressor(base_score=None, booster=None,\n colsample_bylevel=None, colsample_bynode=None,\n colsample_bytree=None, enable_categorical=False,\n gamma=None, gpu_id=None, importance_type=None,\n interaction_constraints=None,\n learning_rate=None, max_delta_step=None,\n max_depth=None, min_child_weight=None,\n missing=nan, monotone_constraints=None,\n n_estimators=150, n_jobs=-1,\n num_parallel_tree=None, predictor=None,\n random_state=0, reg_alpha=None, reg_lambda=None,\n scale_pos_weight=None, subsample=None,\n tree_method=None, validate_parameters=None,\n verbosity=0),\n importance_type='shap_importances', min_features_to_select=1,\n train_importance=False)"},"metadata":{}}]},{"cell_type":"code","source":"model.estimator_, model.n_features_","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:51:59.187276Z","iopub.execute_input":"2022-01-01T11:51:59.188081Z","iopub.status.idle":"2022-01-01T11:51:59.199276Z","shell.execute_reply.started":"2022-01-01T11:51:59.188004Z","shell.execute_reply":"2022-01-01T11:51:59.198325Z"},"trusted":true},"execution_count":29,"outputs":[{"execution_count":29,"output_type":"execute_result","data":{"text/plain":"(XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,\n colsample_bynode=1, colsample_bytree=1, enable_categorical=False,\n gamma=0, gpu_id=-1, importance_type=None,\n interaction_constraints='', learning_rate=0.300000012,\n max_delta_step=0, max_depth=6, min_child_weight=1, missing=nan,\n monotone_constraints='()', n_estimators=150, n_jobs=-1,\n num_parallel_tree=1, predictor='auto', random_state=0, reg_alpha=0,\n reg_lambda=1, scale_pos_weight=1, subsample=1, tree_method='exact',\n validate_parameters=1, verbosity=0),\n 9)"},"metadata":{}}]},{"cell_type":"code","source":"(model.score(X_regr_valid, y_regr_valid), \n model.predict(X_regr_valid).shape, \n model.transform(X_regr_valid).shape)","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:51:59.200366Z","iopub.execute_input":"2022-01-01T11:51:59.200640Z","iopub.status.idle":"2022-01-01T11:51:59.222774Z","shell.execute_reply.started":"2022-01-01T11:51:59.200592Z","shell.execute_reply":"2022-01-01T11:51:59.222078Z"},"trusted":true},"execution_count":30,"outputs":[{"execution_count":30,"output_type":"execute_result","data":{"text/plain":"(0.7249664284333042, (1800,), (1800, 9))"},"metadata":{}}]},{"cell_type":"markdown","source":"# Hyperparameters Tuning + Features Selection","metadata":{}},{"cell_type":"code","source":"### HYPERPARAM TUNING WITH GRID-SEARCH + BORUTA ###\n\nmodel = BoostBoruta(clf_xgb, param_grid=param_grid, max_iter=200, perc=100)\nmodel.fit(X_clf_train, y_clf_train, eval_set=[(X_clf_valid, y_clf_valid)], early_stopping_rounds=6, verbose=0)","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:51:59.224176Z","iopub.execute_input":"2022-01-01T11:51:59.224707Z","iopub.status.idle":"2022-01-01T12:14:09.045290Z","shell.execute_reply.started":"2022-01-01T11:51:59.224667Z","shell.execute_reply":"2022-01-01T12:14:09.044649Z"},"trusted":true},"execution_count":31,"outputs":[{"name":"stdout","text":"\n8 trials detected for ('learning_rate', 'num_leaves', 'max_depth')\n\ntrial: 0001 ### iterations: 00026 ### eval_score: 0.20001\ntrial: 0002 ### iterations: 00022 ### eval_score: 0.20348\ntrial: 0003 ### iterations: 00026 ### eval_score: 0.20001\ntrial: 0004 ### iterations: 00022 ### eval_score: 0.20348\ntrial: 0005 ### iterations: 00048 ### eval_score: 0.19925\ntrial: 0006 ### iterations: 00052 ### eval_score: 0.20307\ntrial: 0007 ### iterations: 00048 ### eval_score: 0.19925\ntrial: 0008 ### iterations: 00052 ### eval_score: 0.20307\n","output_type":"stream"},{"execution_count":31,"output_type":"execute_result","data":{"text/plain":"BoostBoruta(estimator=XGBClassifier(base_score=None, booster=None,\n colsample_bylevel=None,\n colsample_bynode=None,\n colsample_bytree=None,\n enable_categorical=False, gamma=None,\n gpu_id=None, importance_type=None,\n interaction_constraints=None,\n learning_rate=None, max_delta_step=None,\n max_depth=None, min_child_weight=None,\n missing=nan, monotone_constraints=None,\n n_estimators=150, n_jobs=-1,\n num_parallel_tree=None, predictor=None,\n random_state=0, reg_alpha=None,\n reg_lambda=None, scale_pos_weight=None,\n subsample=None, tree_method=None,\n validate_parameters=None, verbosity=0),\n max_iter=200,\n param_grid={'learning_rate': [0.2, 0.1], 'max_depth': [10, 12],\n 'num_leaves': [25, 35]})"},"metadata":{}}]},{"cell_type":"code","source":"model.estimator_, model.best_params_, model.best_score_, model.n_features_","metadata":{"execution":{"iopub.status.busy":"2022-01-01T12:14:09.046490Z","iopub.execute_input":"2022-01-01T12:14:09.047104Z","iopub.status.idle":"2022-01-01T12:14:09.056559Z","shell.execute_reply.started":"2022-01-01T12:14:09.047070Z","shell.execute_reply":"2022-01-01T12:14:09.056076Z"},"trusted":true},"execution_count":32,"outputs":[{"execution_count":32,"output_type":"execute_result","data":{"text/plain":"(XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,\n colsample_bynode=1, colsample_bytree=1, enable_categorical=False,\n gamma=0, gpu_id=-1, importance_type=None,\n interaction_constraints='', learning_rate=0.1, max_delta_step=0,\n max_depth=10, min_child_weight=1, missing=nan,\n monotone_constraints='()', n_estimators=150, n_jobs=-1,\n num_leaves=25, num_parallel_tree=1, predictor='auto',\n random_state=0, reg_alpha=0, reg_lambda=1, scale_pos_weight=1,\n subsample=1, tree_method='exact', validate_parameters=1,\n verbosity=0),\n {'learning_rate': 0.1, 'num_leaves': 25, 'max_depth': 10},\n 0.199248,\n 11)"},"metadata":{}}]},{"cell_type":"code","source":"(model.score(X_clf_valid, y_clf_valid), \n model.predict(X_clf_valid).shape, \n model.transform(X_clf_valid).shape,\n model.predict_proba(X_clf_valid).shape)","metadata":{"execution":{"iopub.status.busy":"2022-01-01T12:14:09.057462Z","iopub.execute_input":"2022-01-01T12:14:09.057740Z","iopub.status.idle":"2022-01-01T12:14:09.086612Z","shell.execute_reply.started":"2022-01-01T12:14:09.057716Z","shell.execute_reply":"2022-01-01T12:14:09.085920Z"},"trusted":true},"execution_count":33,"outputs":[{"execution_count":33,"output_type":"execute_result","data":{"text/plain":"(0.9144444444444444, (1800,), (1800, 11), (1800, 2))"},"metadata":{}}]},{"cell_type":"code","source":"### HYPERPARAM TUNING WITH RANDOM-SEARCH + RECURSIVE FEATURE ELIMINATION (RFE) ###\n\nmodel = BoostRFE(\n regr_xgb, param_grid=param_dist, min_features_to_select=1, step=1,\n n_iter=8, sampling_seed=0\n)\nmodel.fit(X_regr_train, y_regr_train, eval_set=[(X_regr_valid, y_regr_valid)], early_stopping_rounds=6, verbose=0)","metadata":{"execution":{"iopub.status.busy":"2022-01-01T12:14:09.087595Z","iopub.execute_input":"2022-01-01T12:14:09.087798Z","iopub.status.idle":"2022-01-01T12:16:42.203604Z","shell.execute_reply.started":"2022-01-01T12:14:09.087772Z","shell.execute_reply":"2022-01-01T12:16:42.202743Z"},"trusted":true},"execution_count":34,"outputs":[{"name":"stdout","text":"\n8 trials detected for ('learning_rate', 'num_leaves', 'max_depth')\n\ntrial: 0001 ### iterations: 00021 ### eval_score: 0.25941\ntrial: 0002 ### iterations: 00077 ### eval_score: 0.25055\ntrial: 0003 ### iterations: 00086 ### eval_score: 0.25676\ntrial: 0004 ### iterations: 00098 ### eval_score: 0.25383\ntrial: 0005 ### iterations: 00050 ### eval_score: 0.25751\ntrial: 0006 ### iterations: 00028 ### eval_score: 0.26007\ntrial: 0007 ### iterations: 00084 ### eval_score: 0.2603\ntrial: 0008 ### iterations: 00024 ### eval_score: 0.26278\n","output_type":"stream"},{"execution_count":34,"output_type":"execute_result","data":{"text/plain":"BoostRFE(estimator=XGBRegressor(base_score=None, booster=None,\n colsample_bylevel=None, colsample_bynode=None,\n colsample_bytree=None, enable_categorical=False,\n gamma=None, gpu_id=None, importance_type=None,\n interaction_constraints=None,\n learning_rate=None, max_delta_step=None,\n max_depth=None, min_child_weight=None,\n missing=nan, monotone_constraints=None,\n n_estimato...\n random_state=0, reg_alpha=None, reg_lambda=None,\n scale_pos_weight=None, subsample=None,\n tree_method=None, validate_parameters=None,\n verbosity=0),\n min_features_to_select=1, n_iter=8,\n param_grid={'learning_rate': <scipy.stats._distn_infrastructure.rv_frozen object at 0x7f5d29426490>,\n 'max_depth': [10, 12],\n 'num_leaves': <scipy.stats._distn_infrastructure.rv_frozen object at 0x7f5d29426710>},\n sampling_seed=0)"},"metadata":{}}]},{"cell_type":"code","source":"model.estimator_, model.best_params_, model.best_score_, model.n_features_","metadata":{"execution":{"iopub.status.busy":"2022-01-01T12:16:42.205176Z","iopub.execute_input":"2022-01-01T12:16:42.205439Z","iopub.status.idle":"2022-01-01T12:16:42.215355Z","shell.execute_reply.started":"2022-01-01T12:16:42.205404Z","shell.execute_reply":"2022-01-01T12:16:42.214732Z"},"trusted":true},"execution_count":35,"outputs":[{"execution_count":35,"output_type":"execute_result","data":{"text/plain":"(XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,\n colsample_bynode=1, colsample_bytree=1, enable_categorical=False,\n gamma=0, gpu_id=-1, importance_type=None,\n interaction_constraints='', learning_rate=0.1350674222191923,\n max_delta_step=0, max_depth=10, min_child_weight=1, missing=nan,\n monotone_constraints='()', n_estimators=150, n_jobs=-1,\n num_leaves=38, num_parallel_tree=1, predictor='auto',\n random_state=0, reg_alpha=0, reg_lambda=1, scale_pos_weight=1,\n subsample=1, tree_method='exact', validate_parameters=1,\n verbosity=0),\n {'learning_rate': 0.1350674222191923, 'num_leaves': 38, 'max_depth': 10},\n 0.250552,\n 10)"},"metadata":{}}]},{"cell_type":"code","source":"(model.score(X_regr_valid, y_regr_valid), \n model.predict(X_regr_valid).shape, \n model.transform(X_regr_valid).shape)","metadata":{"execution":{"iopub.status.busy":"2022-01-01T12:16:42.216398Z","iopub.execute_input":"2022-01-01T12:16:42.216642Z","iopub.status.idle":"2022-01-01T12:16:42.242381Z","shell.execute_reply.started":"2022-01-01T12:16:42.216606Z","shell.execute_reply":"2022-01-01T12:16:42.241879Z"},"trusted":true},"execution_count":36,"outputs":[{"execution_count":36,"output_type":"execute_result","data":{"text/plain":"(0.7488873349293266, (1800,), (1800, 10))"},"metadata":{}}]},{"cell_type":"code","source":"### HYPERPARAM TUNING WITH HYPEROPT + RECURSIVE FEATURE ADDITION (RFA) ###\n\nmodel = BoostRFA(\n regr_xgb, param_grid=param_dist_hyperopt, min_features_to_select=1, step=1,\n n_iter=8, sampling_seed=0\n)\nmodel.fit(\n X_regr_train, y_regr_train, trials=Trials(), \n eval_set=[(X_regr_valid, y_regr_valid)], early_stopping_rounds=6, verbose=0\n)","metadata":{"execution":{"iopub.status.busy":"2022-01-01T12:16:42.245655Z","iopub.execute_input":"2022-01-01T12:16:42.247219Z","iopub.status.idle":"2022-01-01T12:26:08.685124Z","shell.execute_reply.started":"2022-01-01T12:16:42.247188Z","shell.execute_reply":"2022-01-01T12:26:08.684364Z"},"trusted":true},"execution_count":37,"outputs":[{"name":"stdout","text":"\n8 trials detected for ('max_depth', 'learning_rate', 'colsample_bytree')\n\ntrial: 0001 ### iterations: 00149 ### eval_score: 0.26412\ntrial: 0002 ### iterations: 00080 ### eval_score: 0.25357\ntrial: 0003 ### iterations: 00054 ### eval_score: 0.26123\ntrial: 0004 ### iterations: 00149 ### eval_score: 0.2801\ntrial: 0005 ### iterations: 00149 ### eval_score: 0.27046\ntrial: 0006 ### iterations: 00149 ### eval_score: 0.24789\ntrial: 0007 ### iterations: 00054 ### eval_score: 0.25928\ntrial: 0008 ### iterations: 00140 ### eval_score: 0.27284\n","output_type":"stream"},{"execution_count":37,"output_type":"execute_result","data":{"text/plain":"BoostRFA(estimator=XGBRegressor(base_score=None, booster=None,\n colsample_bylevel=None, colsample_bynode=None,\n colsample_bytree=None, enable_categorical=False,\n gamma=None, gpu_id=None, importance_type=None,\n interaction_constraints=None,\n learning_rate=None, max_delta_step=None,\n max_depth=None, min_child_weight=None,\n missing=nan, monotone_constraints=None,\n n_estimato...\n random_state=0, reg_alpha=None, reg_lambda=None,\n scale_pos_weight=None, subsample=None,\n tree_method=None, validate_parameters=None,\n verbosity=0),\n min_features_to_select=1, n_iter=8,\n param_grid={'colsample_bytree': <hyperopt.pyll.base.Apply object at 0x7f5d29426890>,\n 'learning_rate': <hyperopt.pyll.base.Apply object at 0x7f5d29426a50>,\n 'max_depth': <hyperopt.pyll.base.Apply object at 0x7f5d29426790>},\n sampling_seed=0)"},"metadata":{}}]},{"cell_type":"code","source":"model.estimator_, model.best_params_, model.best_score_, model.n_features_","metadata":{"execution":{"iopub.status.busy":"2022-01-01T12:26:08.686184Z","iopub.execute_input":"2022-01-01T12:26:08.686931Z","iopub.status.idle":"2022-01-01T12:26:08.696854Z","shell.execute_reply.started":"2022-01-01T12:26:08.686898Z","shell.execute_reply":"2022-01-01T12:26:08.696004Z"},"trusted":true},"execution_count":38,"outputs":[{"execution_count":38,"output_type":"execute_result","data":{"text/plain":"(XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,\n colsample_bynode=1, colsample_bytree=0.7597292534356749,\n enable_categorical=False, gamma=0, gpu_id=-1, importance_type=None,\n interaction_constraints='', learning_rate=0.059836658149176665,\n max_delta_step=0, max_depth=16, min_child_weight=1, missing=nan,\n monotone_constraints='()', n_estimators=150, n_jobs=-1,\n num_parallel_tree=1, predictor='auto', random_state=0, reg_alpha=0,\n reg_lambda=1, scale_pos_weight=1, subsample=1, tree_method='exact',\n validate_parameters=1, verbosity=0),\n {'colsample_bytree': 0.7597292534356749,\n 'learning_rate': 0.059836658149176665,\n 'max_depth': 16},\n 0.247887,\n 8)"},"metadata":{}}]},{"cell_type":"code","source":"(model.score(X_regr_valid, y_regr_valid), \n model.predict(X_regr_valid).shape, \n model.transform(X_regr_valid).shape)","metadata":{"execution":{"iopub.status.busy":"2022-01-01T12:26:08.697934Z","iopub.execute_input":"2022-01-01T12:26:08.698155Z","iopub.status.idle":"2022-01-01T12:26:08.736781Z","shell.execute_reply.started":"2022-01-01T12:26:08.698128Z","shell.execute_reply":"2022-01-01T12:26:08.736145Z"},"trusted":true},"execution_count":39,"outputs":[{"execution_count":39,"output_type":"execute_result","data":{"text/plain":"(0.7542006308661441, (1800,), (1800, 8))"},"metadata":{}}]},{"cell_type":"markdown","source":"# Hyperparameters Tuning + Features Selection with SHAP","metadata":{}},{"cell_type":"code","source":"### HYPERPARAM TUNING WITH GRID-SEARCH + BORUTA SHAP ###\n\nmodel = BoostBoruta(\n clf_xgb, param_grid=param_grid, max_iter=200, perc=100,\n importance_type='shap_importances', train_importance=False\n)\nmodel.fit(X_clf_train, y_clf_train, eval_set=[(X_clf_valid, y_clf_valid)], early_stopping_rounds=6, verbose=0)","metadata":{"scrolled":true,"execution":{"iopub.status.busy":"2022-01-01T12:26:08.740222Z","iopub.execute_input":"2022-01-01T12:26:08.741848Z","iopub.status.idle":"2022-01-01T12:56:13.612807Z","shell.execute_reply.started":"2022-01-01T12:26:08.741813Z","shell.execute_reply":"2022-01-01T12:56:13.611991Z"},"trusted":true},"execution_count":40,"outputs":[{"name":"stdout","text":"\n8 trials detected for ('learning_rate', 'num_leaves', 'max_depth')\n\ntrial: 0001 ### iterations: 00024 ### eval_score: 0.20151\ntrial: 0002 ### iterations: 00020 ### eval_score: 0.20877\ntrial: 0003 ### iterations: 00024 ### eval_score: 0.20151\ntrial: 0004 ### iterations: 00020 ### eval_score: 0.20877\ntrial: 0005 ### iterations: 00048 ### eval_score: 0.20401\ntrial: 0006 ### iterations: 00048 ### eval_score: 0.20575\ntrial: 0007 ### iterations: 00048 ### eval_score: 0.20401\ntrial: 0008 ### iterations: 00048 ### eval_score: 0.20575\n","output_type":"stream"},{"execution_count":40,"output_type":"execute_result","data":{"text/plain":"BoostBoruta(estimator=XGBClassifier(base_score=None, booster=None,\n colsample_bylevel=None,\n colsample_bynode=None,\n colsample_bytree=None,\n enable_categorical=False, gamma=None,\n gpu_id=None, importance_type=None,\n interaction_constraints=None,\n learning_rate=None, max_delta_step=None,\n max_depth=None, min_child_weight=None,\n missing=nan, monotone_constraints=None,\n n_estimators=150, n_jobs=-1,\n num_parallel_tree=None, predictor=None,\n random_state=0, reg_alpha=None,\n reg_lambda=None, scale_pos_weight=None,\n subsample=None, tree_method=None,\n validate_parameters=None, verbosity=0),\n importance_type='shap_importances', max_iter=200,\n param_grid={'learning_rate': [0.2, 0.1], 'max_depth': [10, 12],\n 'num_leaves': [25, 35]},\n train_importance=False)"},"metadata":{}}]},{"cell_type":"code","source":"model.estimator_, model.best_params_, model.best_score_, model.n_features_","metadata":{"execution":{"iopub.status.busy":"2022-01-01T12:56:13.617168Z","iopub.execute_input":"2022-01-01T12:56:13.617372Z","iopub.status.idle":"2022-01-01T12:56:13.626563Z","shell.execute_reply.started":"2022-01-01T12:56:13.617349Z","shell.execute_reply":"2022-01-01T12:56:13.626036Z"},"trusted":true},"execution_count":41,"outputs":[{"execution_count":41,"output_type":"execute_result","data":{"text/plain":"(XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,\n colsample_bynode=1, colsample_bytree=1, enable_categorical=False,\n gamma=0, gpu_id=-1, importance_type=None,\n interaction_constraints='', learning_rate=0.2, max_delta_step=0,\n max_depth=10, min_child_weight=1, missing=nan,\n monotone_constraints='()', n_estimators=150, n_jobs=-1,\n num_leaves=25, num_parallel_tree=1, predictor='auto',\n random_state=0, reg_alpha=0, reg_lambda=1, scale_pos_weight=1,\n subsample=1, tree_method='exact', validate_parameters=1,\n verbosity=0),\n {'learning_rate': 0.2, 'num_leaves': 25, 'max_depth': 10},\n 0.201509,\n 10)"},"metadata":{}}]},{"cell_type":"code","source":"(model.score(X_clf_valid, y_clf_valid), \n model.predict(X_clf_valid).shape, \n model.transform(X_clf_valid).shape,\n model.predict_proba(X_clf_valid).shape)","metadata":{"execution":{"iopub.status.busy":"2022-01-01T12:56:13.627454Z","iopub.execute_input":"2022-01-01T12:56:13.627825Z","iopub.status.idle":"2022-01-01T12:56:13.665907Z","shell.execute_reply.started":"2022-01-01T12:56:13.627797Z","shell.execute_reply":"2022-01-01T12:56:13.664686Z"},"trusted":true},"execution_count":42,"outputs":[{"execution_count":42,"output_type":"execute_result","data":{"text/plain":"(0.9144444444444444, (1800,), (1800, 10), (1800, 2))"},"metadata":{}}]},{"cell_type":"code","source":"### HYPERPARAM TUNING WITH RANDOM-SEARCH + RECURSIVE FEATURE ELIMINATION (RFE) SHAP ###\n\nmodel = BoostRFE(\n regr_xgb, param_grid=param_dist, min_features_to_select=1, step=1,\n n_iter=8, sampling_seed=0,\n importance_type='shap_importances', train_importance=False\n)\nmodel.fit(X_regr_train, y_regr_train, eval_set=[(X_regr_valid, y_regr_valid)], early_stopping_rounds=6, verbose=0)","metadata":{"execution":{"iopub.status.busy":"2022-01-01T12:56:13.667149Z","iopub.execute_input":"2022-01-01T12:56:13.667539Z","iopub.status.idle":"2022-01-01T13:08:38.854835Z","shell.execute_reply.started":"2022-01-01T12:56:13.667509Z","shell.execute_reply":"2022-01-01T13:08:38.854142Z"},"trusted":true},"execution_count":43,"outputs":[{"name":"stdout","text":"\n8 trials detected for ('learning_rate', 'num_leaves', 'max_depth')\n\ntrial: 0001 ### iterations: 00021 ### eval_score: 0.25941\ntrial: 0002 ### iterations: 00064 ### eval_score: 0.25075\ntrial: 0003 ### iterations: 00075 ### eval_score: 0.25493\ntrial: 0004 ### iterations: 00084 ### eval_score: 0.25002\ntrial: 0005 ### iterations: 00093 ### eval_score: 0.25609\ntrial: 0006 ### iterations: 00039 ### eval_score: 0.2573\ntrial: 0007 ### iterations: 00074 ### eval_score: 0.25348\ntrial: 0008 ### iterations: 00032 ### eval_score: 0.2583\n","output_type":"stream"},{"execution_count":43,"output_type":"execute_result","data":{"text/plain":"BoostRFE(estimator=XGBRegressor(base_score=None, booster=None,\n colsample_bylevel=None, colsample_bynode=None,\n colsample_bytree=None, enable_categorical=False,\n gamma=None, gpu_id=None, importance_type=None,\n interaction_constraints=None,\n learning_rate=None, max_delta_step=None,\n max_depth=None, min_child_weight=None,\n missing=nan, monotone_constraints=None,\n n_estimato...\n tree_method=None, validate_parameters=None,\n verbosity=0),\n importance_type='shap_importances', min_features_to_select=1, n_iter=8,\n param_grid={'learning_rate': <scipy.stats._distn_infrastructure.rv_frozen object at 0x7f5d29426490>,\n 'max_depth': [10, 12],\n 'num_leaves': <scipy.stats._distn_infrastructure.rv_frozen object at 0x7f5d29426710>},\n sampling_seed=0, train_importance=False)"},"metadata":{}}]},{"cell_type":"code","source":"model.estimator_, model.best_params_, model.best_score_, model.n_features_","metadata":{"execution":{"iopub.status.busy":"2022-01-01T13:08:38.855807Z","iopub.execute_input":"2022-01-01T13:08:38.856007Z","iopub.status.idle":"2022-01-01T13:08:38.866421Z","shell.execute_reply.started":"2022-01-01T13:08:38.855982Z","shell.execute_reply":"2022-01-01T13:08:38.865771Z"},"trusted":true},"execution_count":44,"outputs":[{"execution_count":44,"output_type":"execute_result","data":{"text/plain":"(XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,\n colsample_bynode=1, colsample_bytree=1, enable_categorical=False,\n gamma=0, gpu_id=-1, importance_type=None,\n interaction_constraints='', learning_rate=0.1669837381562427,\n max_delta_step=0, max_depth=10, min_child_weight=1, missing=nan,\n monotone_constraints='()', n_estimators=150, n_jobs=-1,\n num_leaves=25, num_parallel_tree=1, predictor='auto',\n random_state=0, reg_alpha=0, reg_lambda=1, scale_pos_weight=1,\n subsample=1, tree_method='exact', validate_parameters=1,\n verbosity=0),\n {'learning_rate': 0.1669837381562427, 'num_leaves': 25, 'max_depth': 10},\n 0.250021,\n 11)"},"metadata":{}}]},{"cell_type":"code","source":"(model.score(X_regr_valid, y_regr_valid), \n model.predict(X_regr_valid).shape, \n model.transform(X_regr_valid).shape)","metadata":{"execution":{"iopub.status.busy":"2022-01-01T13:08:38.867249Z","iopub.execute_input":"2022-01-01T13:08:38.867888Z","iopub.status.idle":"2022-01-01T13:08:38.887178Z","shell.execute_reply.started":"2022-01-01T13:08:38.867860Z","shell.execute_reply":"2022-01-01T13:08:38.886666Z"},"trusted":true},"execution_count":45,"outputs":[{"execution_count":45,"output_type":"execute_result","data":{"text/plain":"(0.7499501426259738, (1800,), (1800, 11))"},"metadata":{}}]},{"cell_type":"code","source":"### HYPERPARAM TUNING WITH HYPEROPT + RECURSIVE FEATURE ADDITION (RFA) SHAP ###\n\nmodel = BoostRFA(\n regr_xgb, param_grid=param_dist_hyperopt, min_features_to_select=1, step=1,\n n_iter=8, sampling_seed=0,\n importance_type='shap_importances', train_importance=False\n)\nmodel.fit(\n X_regr_train, y_regr_train, trials=Trials(), \n eval_set=[(X_regr_valid, y_regr_valid)], early_stopping_rounds=6, verbose=0\n)","metadata":{"execution":{"iopub.status.busy":"2022-01-01T13:08:38.890197Z","iopub.execute_input":"2022-01-01T13:08:38.891876Z","iopub.status.idle":"2022-01-01T13:41:32.886109Z","shell.execute_reply.started":"2022-01-01T13:08:38.891845Z","shell.execute_reply":"2022-01-01T13:41:32.885257Z"},"trusted":true},"execution_count":46,"outputs":[{"name":"stdout","text":"\n8 trials detected for ('max_depth', 'learning_rate', 'colsample_bytree')\n\ntrial: 0001 ### iterations: 00149 ### eval_score: 0.25811\ntrial: 0002 ### iterations: 00078 ### eval_score: 0.25554\ntrial: 0003 ### iterations: 00059 ### eval_score: 0.26658\ntrial: 0004 ### iterations: 00149 ### eval_score: 0.27356\ntrial: 0005 ### iterations: 00149 ### eval_score: 0.26426\ntrial: 0006 ### iterations: 00149 ### eval_score: 0.25537\ntrial: 0007 ### iterations: 00052 ### eval_score: 0.26107\ntrial: 0008 ### iterations: 00137 ### eval_score: 0.27787\n","output_type":"stream"},{"execution_count":46,"output_type":"execute_result","data":{"text/plain":"BoostRFA(estimator=XGBRegressor(base_score=None, booster=None,\n colsample_bylevel=None, colsample_bynode=None,\n colsample_bytree=None, enable_categorical=False,\n gamma=None, gpu_id=None, importance_type=None,\n interaction_constraints=None,\n learning_rate=None, max_delta_step=None,\n max_depth=None, min_child_weight=None,\n missing=nan, monotone_constraints=None,\n n_estimato...\n tree_method=None, validate_parameters=None,\n verbosity=0),\n importance_type='shap_importances', min_features_to_select=1, n_iter=8,\n param_grid={'colsample_bytree': <hyperopt.pyll.base.Apply object at 0x7f5d29426890>,\n 'learning_rate': <hyperopt.pyll.base.Apply object at 0x7f5d29426a50>,\n 'max_depth': <hyperopt.pyll.base.Apply object at 0x7f5d29426790>},\n sampling_seed=0, train_importance=False)"},"metadata":{}}]},{"cell_type":"code","source":"model.estimator_, model.best_params_, model.best_score_, model.n_features_","metadata":{"execution":{"iopub.status.busy":"2022-01-01T13:41:32.887300Z","iopub.execute_input":"2022-01-01T13:41:32.887495Z","iopub.status.idle":"2022-01-01T13:41:32.897203Z","shell.execute_reply.started":"2022-01-01T13:41:32.887472Z","shell.execute_reply":"2022-01-01T13:41:32.896455Z"},"trusted":true},"execution_count":47,"outputs":[{"execution_count":47,"output_type":"execute_result","data":{"text/plain":"(XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,\n colsample_bynode=1, colsample_bytree=0.7597292534356749,\n enable_categorical=False, gamma=0, gpu_id=-1, importance_type=None,\n interaction_constraints='', learning_rate=0.059836658149176665,\n max_delta_step=0, max_depth=16, min_child_weight=1, missing=nan,\n monotone_constraints='()', n_estimators=150, n_jobs=-1,\n num_parallel_tree=1, predictor='auto', random_state=0, reg_alpha=0,\n reg_lambda=1, scale_pos_weight=1, subsample=1, tree_method='exact',\n validate_parameters=1, verbosity=0),\n {'colsample_bytree': 0.7597292534356749,\n 'learning_rate': 0.059836658149176665,\n 'max_depth': 16},\n 0.255374,\n 11)"},"metadata":{}}]},{"cell_type":"code","source":"(model.score(X_regr_valid, y_regr_valid), \n model.predict(X_regr_valid).shape, \n model.transform(X_regr_valid).shape)","metadata":{"execution":{"iopub.status.busy":"2022-01-01T13:41:32.898201Z","iopub.execute_input":"2022-01-01T13:41:32.898493Z","iopub.status.idle":"2022-01-01T13:41:32.931801Z","shell.execute_reply.started":"2022-01-01T13:41:32.898469Z","shell.execute_reply":"2022-01-01T13:41:32.931131Z"},"trusted":true},"execution_count":48,"outputs":[{"execution_count":48,"output_type":"execute_result","data":{"text/plain":"(0.7391290836488575, (1800,), (1800, 11))"},"metadata":{}}]},{"cell_type":"markdown","source":"# CUSTOM EVAL METRIC SUPPORT","metadata":{}},{"cell_type":"code","source":"from sklearn.metrics import roc_auc_score\n\ndef AUC(y_hat, dtrain):\n y_true = dtrain.get_label()\n return 'auc', roc_auc_score(y_true, y_hat)","metadata":{"execution":{"iopub.status.busy":"2022-01-01T13:41:32.932773Z","iopub.execute_input":"2022-01-01T13:41:32.932979Z","iopub.status.idle":"2022-01-01T13:41:32.940277Z","shell.execute_reply.started":"2022-01-01T13:41:32.932952Z","shell.execute_reply":"2022-01-01T13:41:32.939659Z"},"trusted":true},"execution_count":49,"outputs":[]},{"cell_type":"code","source":"model = BoostRFE(\n clf_xgb, \n param_grid=param_grid, min_features_to_select=1, step=1,\n greater_is_better=True\n)\nmodel.fit(\n X_clf_train, y_clf_train, \n eval_set=[(X_clf_valid, y_clf_valid)], early_stopping_rounds=6, verbose=0,\n eval_metric=AUC\n)","metadata":{"execution":{"iopub.status.busy":"2022-01-01T13:41:32.943194Z","iopub.execute_input":"2022-01-01T13:41:32.944797Z","iopub.status.idle":"2022-01-01T13:43:50.574377Z","shell.execute_reply.started":"2022-01-01T13:41:32.944765Z","shell.execute_reply":"2022-01-01T13:43:50.573628Z"},"trusted":true},"execution_count":50,"outputs":[{"name":"stdout","text":"\n8 trials detected for ('learning_rate', 'num_leaves', 'max_depth')\n\ntrial: 0001 ### iterations: 00017 ### eval_score: 0.9757\ntrial: 0002 ### iterations: 00026 ### eval_score: 0.97632\ntrial: 0003 ### iterations: 00017 ### eval_score: 0.9757\ntrial: 0004 ### iterations: 00026 ### eval_score: 0.97632\ntrial: 0005 ### iterations: 00033 ### eval_score: 0.97594\ntrial: 0006 ### iterations: 00034 ### eval_score: 0.97577\ntrial: 0007 ### iterations: 00033 ### eval_score: 0.97594\ntrial: 0008 ### iterations: 00034 ### eval_score: 0.97577\n","output_type":"stream"},{"execution_count":50,"output_type":"execute_result","data":{"text/plain":"BoostRFE(estimator=XGBClassifier(base_score=None, booster=None,\n colsample_bylevel=None, colsample_bynode=None,\n colsample_bytree=None,\n enable_categorical=False, gamma=None,\n gpu_id=None, importance_type=None,\n interaction_constraints=None,\n learning_rate=None, max_delta_step=None,\n max_depth=None, min_child_weight=None,\n missing=nan, monotone_constraints=None,\n n_estimators=150, n_jobs=-1,\n num_parallel_tree=None, predictor=None,\n random_state=0, reg_alpha=None,\n reg_lambda=None, scale_pos_weight=None,\n subsample=None, tree_method=None,\n validate_parameters=None, verbosity=0),\n greater_is_better=True, min_features_to_select=1,\n param_grid={'learning_rate': [0.2, 0.1], 'max_depth': [10, 12],\n 'num_leaves': [25, 35]})"},"metadata":{}}]}]}
================================================
FILE: requirements.txt
================================================
numpy
scipy
scikit-learn>=0.24.1
shap>=0.39.0
hyperopt==0.2.5
================================================
FILE: setup.py
================================================
import pathlib
from setuptools import setup, find_packages
HERE = pathlib.Path(__file__).parent
VERSION = '0.2.7'
PACKAGE_NAME = 'shap-hypetune'
AUTHOR = 'Marco Cerliani'
AUTHOR_EMAIL = 'cerlymarco@gmail.com'
URL = 'https://github.com/cerlymarco/shap-hypetune'
LICENSE = 'MIT'
DESCRIPTION = 'A python package for simultaneous Hyperparameters Tuning and Features Selection for Gradient Boosting Models.'
LONG_DESCRIPTION = (HERE / "README.md").read_text()
LONG_DESC_TYPE = "text/markdown"
INSTALL_REQUIRES = [
'numpy',
'scipy',
'scikit-learn>=0.24.1',
'shap>=0.39.0',
'hyperopt==0.2.5'
]
setup(name=PACKAGE_NAME,
version=VERSION,
description=DESCRIPTION,
long_description=LONG_DESCRIPTION,
long_description_content_type=LONG_DESC_TYPE,
author=AUTHOR,
license=LICENSE,
author_email=AUTHOR_EMAIL,
url=URL,
install_requires=INSTALL_REQUIRES,
python_requires='>=3',
packages=find_packages()
)
================================================
FILE: shaphypetune/__init__.py
================================================
from .utils import *
from ._classes import *
from .shaphypetune import *
================================================
FILE: shaphypetune/_classes.py
================================================
import io
import contextlib
import warnings
import numpy as np
import scipy as sp
from copy import deepcopy
from sklearn.base import clone
from sklearn.utils.validation import check_is_fitted
from sklearn.base import BaseEstimator, TransformerMixin
from joblib import Parallel, delayed
from hyperopt import fmin, tpe
from .utils import ParameterSampler, _check_param, _check_boosting
from .utils import _set_categorical_indexes, _get_categorical_support
from .utils import _feature_importances, _shap_importances
class _BoostSearch(BaseEstimator):
"""Base class for BoostSearch meta-estimator.
Warning: This class should not be used directly. Use derived classes
instead.
"""
def __init__(self):
pass
def _validate_param_grid(self, fit_params):
"""Private method to validate fitting parameters."""
if not isinstance(self.param_grid, dict):
raise ValueError("Pass param_grid in dict format.")
self._param_grid = self.param_grid.copy()
for p_k, p_v in self._param_grid.items():
self._param_grid[p_k] = _check_param(p_v)
if 'eval_set' not in fit_params:
raise ValueError(
"When tuning parameters, at least "
"a evaluation set is required.")
self._eval_score = np.argmax if self.greater_is_better else np.argmin
self._score_sign = -1 if self.greater_is_better else 1
rs = ParameterSampler(
n_iter=self.n_iter,
param_distributions=self._param_grid,
random_state=self.sampling_seed
)
self._param_combi, self._tuning_type = rs.sample()
self._trial_id = 1
if self.verbose > 0:
n_trials = self.n_iter if self._tuning_type is 'hyperopt' \
else len(self._param_combi)
print("\n{} trials detected for {}\n".format(
n_trials, tuple(self.param_grid.keys())))
def _fit(self, X, y, fit_params, params=None):
"""Private method to fit a single boosting model and extract results."""
model = self._build_model(params)
if isinstance(model, _BoostSelector):
model.fit(X=X, y=y, **fit_params)
else:
with contextlib.redirect_stdout(io.StringIO()):
model.fit(X=X, y=y, **fit_params)
results = {'params': params, 'status': 'ok'}
if isinstance(model, _BoostSelector):
results['booster'] = model.estimator_
results['model'] = model
else:
results['booster'] = model
results['model'] = None
if 'eval_set' not in fit_params:
return results
if self.boost_type_ == 'XGB':
# w/ eval_set and w/ early_stopping_rounds
if hasattr(results['booster'], 'best_score'):
results['iterations'] = results['booster'].best_iteration
# w/ eval_set and w/o early_stopping_rounds
else:
valid_id = list(results['booster'].evals_result_.keys())[-1]
eval_metric = list(results['booster'].evals_result_[valid_id])[-1]
results['iterations'] = \
len(results['booster'].evals_result_[valid_id][eval_metric])
else:
# w/ eval_set and w/ early_stopping_rounds
if results['booster'].best_iteration_ is not None:
results['iterations'] = results['booster'].best_iteration_
# w/ eval_set and w/o early_stopping_rounds
else:
valid_id = list(results['booster'].evals_result_.keys())[-1]
eval_metric = list(results['booster'].evals_result_[valid_id])[-1]
results['iterations'] = \
len(results['booster'].evals_result_[valid_id][eval_metric])
if self.boost_type_ == 'XGB':
# w/ eval_set and w/ early_stopping_rounds
if hasattr(results['booster'], 'best_score'):
results['loss'] = results['booster'].best_score
# w/ eval_set and w/o early_stopping_rounds
else:
valid_id = list(results['booster'].evals_result_.keys())[-1]
eval_metric = list(results['booster'].evals_result_[valid_id])[-1]
results['loss'] = \
results['booster'].evals_result_[valid_id][eval_metric][-1]
else:
valid_id = list(results['booster'].best_score_.keys())[-1]
eval_metric = list(results['booster'].best_score_[valid_id])[-1]
results['loss'] = results['booster'].best_score_[valid_id][eval_metric]
if params is not None:
if self.verbose > 0:
msg = "trial: {} ### iterations: {} ### eval_score: {}".format(
str(self._trial_id).zfill(4),
str(results['iterations']).zfill(5),
round(results['loss'], 5)
)
print(msg)
self._trial_id += 1
results['loss'] *= self._score_sign
return results
def fit(self, X, y, trials=None, **fit_params):
"""Fit the provided boosting algorithm while searching the best subset
of features (according to the selected strategy) and choosing the best
parameters configuration (if provided).
It takes the same arguments available in the estimator fit.
Parameters
----------
X : array-like of shape (n_samples, n_features)
The training input samples.
y : array-like of shape (n_samples,)
Target values.
trials : hyperopt.Trials() object, default=None
A hyperopt trials object, used to store intermediate results for all
optimization runs. Effective (and required) only when hyperopt
parameter searching is computed.
**fit_params : Additional fitting arguments.
Returns
-------
self : object
"""
self.boost_type_ = _check_boosting(self.estimator)
if self.param_grid is None:
results = self._fit(X, y, fit_params)
for v in vars(results['model']):
if v.endswith("_") and not v.startswith("__"):
setattr(self, str(v), getattr(results['model'], str(v)))
else:
self._validate_param_grid(fit_params)
if self._tuning_type == 'hyperopt':
if trials is None:
raise ValueError(
"trials must be not None when using hyperopt."
)
search = fmin(
fn=lambda p: self._fit(
params=p, X=X, y=y, fit_params=fit_params
),
space=self._param_combi, algo=tpe.suggest,
max_evals=self.n_iter, trials=trials,
rstate=np.random.RandomState(self.sampling_seed),
show_progressbar=False, verbose=0
)
all_results = trials.results
else:
all_results = Parallel(
n_jobs=self.n_jobs, verbose=self.verbose * int(bool(self.n_jobs))
)(delayed(self._fit)(X, y, fit_params, params)
for params in self._param_combi)
# extract results from parallel loops
self.trials_, self.iterations_, self.scores_, models = [], [], [], []
for job_res in all_results:
self.trials_.append(job_res['params'])
self.iterations_.append(job_res['iterations'])
self.scores_.append(self._score_sign * job_res['loss'])
if isinstance(job_res['model'], _BoostSelector):
models.append(job_res['model'])
else:
models.append(job_res['booster'])
# get the best
id_best = self._eval_score(self.scores_)
self.best_params_ = self.trials_[id_best]
self.best_iter_ = self.iterations_[id_best]
self.best_score_ = self.scores_[id_best]
self.estimator_ = models[id_best]
for v in vars(models[id_best]):
if v.endswith("_") and not v.startswith("__"):
setattr(self, str(v), getattr(models[id_best], str(v)))
return self
def predict(self, X, **predict_params):
"""Predict X.
Parameters
----------
X : array-like of shape (n_samples, n_features)
Samples.
**predict_params : Additional predict arguments.
Returns
-------
pred : ndarray of shape (n_samples,)
The predicted values.
"""
check_is_fitted(self)
if hasattr(self, 'transform'):
X = self.transform(X)
return self.estimator_.predict(X, **predict_params)
def predict_proba(self, X, **predict_params):
"""Predict X probabilities.
Parameters
----------
X : array-like of shape (n_samples, n_features)
Samples.
**predict_params : Additional predict arguments.
Returns
-------
pred : ndarray of shape (n_samples, n_classes)
The predicted values.
"""
check_is_fitted(self)
# raise original AttributeError
getattr(self.estimator_, 'predict_proba')
if hasattr(self, 'transform'):
X = self.transform(X)
return self.estimator_.predict_proba(X, **predict_params)
def score(self, X, y, sample_weight=None):
"""Return the score on the given test data and labels.
Parameters
----------
X : array-like of shape (n_samples, n_features)
Test samples.
y : array-like of shape (n_samples,)
True values for X.
sample_weight : array-like of shape (n_samples,), default=None
Sample weights.
Returns
-------
score : float
Accuracy for classification, R2 for regression.
"""
check_is_fitted(self)
if hasattr(self, 'transform'):
X = self.transform(X)
return self.estimator_.score(X, y, sample_weight=sample_weight)
class _BoostSelector(BaseEstimator, TransformerMixin):
"""Base class for feature selection meta-estimator.
Warning: This class should not be used directly. Use derived classes
instead.
"""
def __init__(self):
pass
def transform(self, X):
"""Reduces the input X to the features selected by Boruta.
Parameters
----------
X : array-like of shape (n_samples, n_features)
Samples.
Returns
-------
X : array-like of shape (n_samples, n_features_)
The input samples with only the selected features by Boruta.
"""
check_is_fitted(self)
shapes = np.shape(X)
if len(shapes) != 2:
raise ValueError("X must be 2D.")
if shapes[1] != self.support_.shape[0]:
raise ValueError(
"Expected {} features, received {}.".format(
self.support_.shape[0], shapes[1]))
if isinstance(X, np.ndarray):
return X[:, self.support_]
elif hasattr(X, 'loc'):
return X.loc[:, self.support_]
else:
raise ValueError("Data type not understood.")
def get_support(self, indices=False):
"""Get a mask, or integer index, of the features selected.
Parameters
----------
indices : bool, default=False
If True, the return value will be an array of integers, rather
than a boolean mask.
Returns
-------
support : array
An index that selects the retained features from a feature vector.
If `indices` is False, this is a boolean array of shape
[# input features], in which an element is True iff its
corresponding feature is selected for retention. If `indices` is
True, this is an integer array of shape [# output features] whose
values are indices into the input feature vector.
"""
check_is_fitted(self)
mask = self.support_
return mask if not indices else np.where(mask)[0]
class _Boruta(_BoostSelector):
"""Base class for BoostBoruta meta-estimator.
Warning: This class should not be used directly. Use derived classes
instead.
Notes
-----
The code for the Boruta algorithm is inspired and improved from:
https://github.com/scikit-learn-contrib/boruta_py
"""
def __init__(self,
estimator, *,
perc=100,
alpha=0.05,
max_iter=100,
early_stopping_boruta_rounds=None,
importance_type='feature_importances',
train_importance=True,
verbose=0):
self.estimator = estimator
self.perc = perc
self.alpha = alpha
self.max_iter = max_iter
self.early_stopping_boruta_rounds = early_stopping_boruta_rounds
self.importance_type = importance_type
self.train_importance = train_importance
self.verbose = verbose
def _create_X(self, X, feat_id_real):
"""Private method to add shadow features to the original ones. """
if isinstance(X, np.ndarray):
X_real = X[:, feat_id_real].copy()
X_sha = X_real.copy()
X_sha = np.apply_along_axis(self._random_state.permutation, 0, X_sha)
X = np.hstack((X_real, X_sha))
elif hasattr(X, 'iloc'):
X_real = X.iloc[:, feat_id_real].copy()
X_sha = X_real.copy()
X_sha = X_sha.apply(self._random_state.permutation)
X_sha = X_sha.astype(X_real.dtypes)
X = X_real.join(X_sha, rsuffix='_SHA')
else:
raise ValueError("Data type not understood.")
return X
def _check_fit_params(self, fit_params, feat_id_real=None):
"""Private method to validate and check fit_params."""
_fit_params = deepcopy(fit_params)
estimator = clone(self.estimator)
# add here possible estimator checks in each iteration
_fit_params = _set_categorical_indexes(
self.support_, self._cat_support, _fit_params, duplicate=True)
if feat_id_real is None: # final model fit
if 'eval_set' in _fit_params:
_fit_params['eval_set'] = list(map(lambda x: (
self.transform(x[0]), x[1]
), _fit_params['eval_set']))
else:
if 'eval_set' in _fit_params: # iterative model fit
_fit_params['eval_set'] = list(map(lambda x: (
self._create_X(x[0], feat_id_real), x[1]
), _fit_params['eval_set']))
if 'feature_name' in _fit_params: # LGB
_fit_params['feature_name'] = 'auto'
if 'feature_weights' in _fit_params: # XGB import warnings
warnings.warn(
"feature_weights is not supported when selecting features. "
"It's automatically set to None.")
_fit_params['feature_weights'] = None
return _fit_params, estimator
def _do_tests(self, dec_reg, hit_reg, iter_id):
"""Private method to operate Bonferroni corrections on the feature
selections."""
active_features = np.where(dec_reg >= 0)[0]
hits = hit_reg[active_features]
# get uncorrected p values based on hit_reg
to_accept_ps = sp.stats.binom.sf(hits - 1, iter_id, .5).flatten()
to_reject_ps = sp.stats.binom.cdf(hits, iter_id, .5).flatten()
# Bonferroni correction with the total n_features in each iteration
to_accept = to_accept_ps <= self.alpha / float(len(dec_reg))
to_reject = to_reject_ps <= self.alpha / float(len(dec_reg))
# find features which are 0 and have been rejected or accepted
to_accept = np.where((dec_reg[active_features] == 0) * to_accept)[0]
to_reject = np.where((dec_reg[active_features] == 0) * to_reject)[0]
# updating dec_reg
dec_reg[active_features[to_accept]] = 1
dec_reg[active_features[to_reject]] = -1
return dec_reg
def fit(self, X, y, **fit_params):
"""Fit the Boruta algorithm to automatically tune
the number of selected features."""
self.boost_type_ = _check_boosting(self.estimator)
if self.max_iter < 1:
raise ValueError('max_iter should be an integer >0.')
if self.perc <= 0 or self.perc > 100:
raise ValueError('The percentile should be between 0 and 100.')
if self.alpha <= 0 or self.alpha > 1:
raise ValueError('alpha should be between 0 and 1.')
if self.early_stopping_boruta_rounds is None:
es_boruta_rounds = self.max_iter
else:
if self.early_stopping_boruta_rounds < 1:
raise ValueError(
'early_stopping_boruta_rounds should be an integer >0.')
es_boruta_rounds = self.early_stopping_boruta_rounds
importances = ['feature_importances', 'shap_importances']
if self.importance_type not in importances:
raise ValueError(
"importance_type must be one of {}. Get '{}'".format(
importances, self.importance_type))
if self.importance_type == 'shap_importances':
if not self.train_importance and not 'eval_set' in fit_params:
raise ValueError(
"When train_importance is set to False, using "
"shap_importances, pass at least a eval_set.")
eval_importance = not self.train_importance and 'eval_set' in fit_params
shapes = np.shape(X)
if len(shapes) != 2:
raise ValueError("X must be 2D.")
n_features = shapes[1]
# create mask for user-defined categorical features
self._cat_support = _get_categorical_support(n_features, fit_params)
# holds the decision about each feature:
# default (0); accepted (1); rejected (-1)
dec_reg = np.zeros(n_features, dtype=int)
dec_history = np.zeros((self.max_iter, n_features), dtype=int)
# counts how many times a given feature was more important than
# the best of the shadow features
hit_reg = np.zeros(n_features, dtype=int)
# record the history of the iterations
imp_history = np.zeros(n_features, dtype=float)
sha_max_history = []
for i in range(self.max_iter):
if (dec_reg != 0).all():
if self.verbose > 1:
print("All Features analyzed. Boruta stop!")
break
if self.verbose > 1:
print('Iteration: {} / {}'.format(i + 1, self.max_iter))
self._random_state = np.random.RandomState(i + 1000)
# add shadow attributes, shuffle and train estimator
self.support_ = dec_reg >= 0
feat_id_real = np.where(self.support_)[0]
n_real = feat_id_real.shape[0]
_fit_params, estimator = self._check_fit_params(fit_params, feat_id_real)
estimator.set_params(random_state=i + 1000)
_X = self._create_X(X, feat_id_real)
with contextlib.redirect_stdout(io.StringIO()):
estimator.fit(_X, y, **_fit_params)
# get coefs
if self.importance_type == 'feature_importances':
coefs = _feature_importances(estimator)
else:
if eval_importance:
coefs = _shap_importances(
estimator, _fit_params['eval_set'][-1][0])
else:
coefs = _shap_importances(estimator, _X)
# separate importances of real and shadow features
imp_sha = coefs[n_real:]
imp_real = np.zeros(n_features) * np.nan
imp_real[feat_id_real] = coefs[:n_real]
# get the threshold of shadow importances used for rejection
imp_sha_max = np.percentile(imp_sha, self.perc)
# record importance history
sha_max_history.append(imp_sha_max)
imp_history = np.vstack((imp_history, imp_real))
# register which feature is more imp than the max of shadows
hit_reg[np.where(imp_real[~np.isnan(imp_real)] > imp_sha_max)[0]] += 1
# check if a feature is doing better than expected by chance
dec_reg = self._do_tests(dec_reg, hit_reg, i + 1)
dec_history[i] = dec_reg
es_id = i - es_boruta_rounds
if es_id >= 0:
if np.equal(dec_history[es_id:(i + 1)], dec_reg).all():
if self.verbose > 0:
print("Boruta early stopping at iteration {}".format(i + 1))
break
confirmed = np.where(dec_reg == 1)[0]
tentative = np.where(dec_reg == 0)[0]
self.support_ = np.zeros(n_features, dtype=bool)
self.ranking_ = np.ones(n_features, dtype=int) * 4
self.n_features_ = confirmed.shape[0]
self.importance_history_ = imp_history[1:]
if tentative.shape[0] > 0:
tentative_median = np.nanmedian(imp_history[1:, tentative], axis=0)
tentative_low = tentative[
np.where(tentative_median <= np.median(sha_max_history))[0]]
tentative_up = np.setdiff1d(tentative, tentative_low)
self.ranking_[tentative_low] = 3
if tentative_up.shape[0] > 0:
self.ranking_[tentative_up] = 2
if confirmed.shape[0] > 0:
self.support_[confirmed] = True
self.ranking_[confirmed] = 1
if (~self.support_).all():
raise RuntimeError(
"Boruta didn't select any feature. Try to increase max_iter or "
"increase (if not None) early_stopping_boruta_rounds or "
"decrese perc.")
_fit_params, self.estimator_ = self._check_fit_params(fit_params)
with contextlib.redirect_stdout(io.StringIO()):
self.estimator_.fit(self.transform(X), y, **_fit_params)
return self
class _RFE(_BoostSelector):
"""Base class for BoostRFE meta-estimator.
Warning: This class should not be used directly. Use derived classes
instead.
"""
def __init__(self,
estimator, *,
min_features_to_select=None,
step=1,
greater_is_better=False,
importance_type='feature_importances',
train_importance=True,
verbose=0):
self.estimator = estimator
self.min_features_to_select = min_features_to_select
self.step = step
self.greater_is_better = greater_is_better
self.importance_type = importance_type
self.train_importance = train_importance
self.verbose = verbose
def _check_fit_params(self, fit_params):
"""Private method to validate and check fit_params."""
_fit_params = deepcopy(fit_params)
estimator = clone(self.estimator)
# add here possible estimator checks in each iteration
_fit_params = _set_categorical_indexes(
self.support_, self._cat_support, _fit_params)
if 'eval_set' in _fit_params:
_fit_params['eval_set'] = list(map(lambda x: (
self.transform(x[0]), x[1]
), _fit_params['eval_set']))
if 'feature_name' in _fit_params: # LGB
_fit_params['feature_name'] = 'auto'
if 'feature_weights' in _fit_params: # XGB import warnings
warnings.warn(
"feature_weights is not supported when selecting features. "
"It's automatically set to None.")
_fit_params['feature_weights'] = None
return _fit_params, estimator
def _step_score(self, estimator):
"""Return the score for a fit on eval_set."""
if self.boost_type_ == 'LGB':
valid_id = list(estimator.best_score_.keys())[-1]
eval_metric = list(estimator.best_score_[valid_id])[-1]
score = estimator.best_score_[valid_id][eval_metric]
else:
# w/ eval_set and w/ early_stopping_rounds
if hasattr(estimator, 'best_score'):
score = estimator.best_score
# w/ eval_set and w/o early_stopping_rounds
else:
valid_id = list(estimator.evals_result_.keys())[-1]
eval_metric = list(estimator.evals_result_[valid_id])[-1]
score = estimator.evals_result_[valid_id][eval_metric][-1]
return score
def fit(self, X, y, **fit_params):
"""Fit the RFE algorithm to automatically tune
the number of selected features."""
self.boost_type_ = _check_boosting(self.estimator)
importances = ['feature_importances', 'shap_importances']
if self.importance_type not in importances:
raise ValueError(
"importance_type must be one of {}. Get '{}'".format(
importances, self.importance_type))
# scoring controls the calculation of self.score_history_
# scoring is used automatically when 'eval_set' is in fit_params
scoring = 'eval_set' in fit_params
if self.importance_type == 'shap_importances':
if not self.train_importance and not scoring:
raise ValueError(
"When train_importance is set to False, using "
"shap_importances, pass at least a eval_set.")
eval_importance = not self.train_importance and scoring
shapes = np.shape(X)
if len(shapes) != 2:
raise ValueError("X must be 2D.")
n_features = shapes[1]
# create mask for user-defined categorical features
self._cat_support = _get_categorical_support(n_features, fit_params)
if self.min_features_to_select is None:
if scoring:
min_features_to_select = 1
else:
min_features_to_select = n_features // 2
else:
min_features_to_select = self.min_features_to_select
if 0.0 < self.step < 1.0:
step = int(max(1, self.step * n_features))
else:
step = int(self.step)
if step <= 0:
raise ValueError("Step must be >0.")
self.support_ = np.ones(n_features, dtype=bool)
self.ranking_ = np.ones(n_features, dtype=int)
if scoring:
self.score_history_ = []
eval_score = np.max if self.greater_is_better else np.min
best_score = -np.inf if self.greater_is_better else np.inf
while np.sum(self.support_) > min_features_to_select:
# remaining features
features = np.arange(n_features)[self.support_]
_fit_params, estimator = self._check_fit_params(fit_params)
if self.verbose > 1:
print("Fitting estimator with {} features".format(
self.support_.sum()))
with contextlib.redirect_stdout(io.StringIO()):
estimator.fit(self.transform(X), y, **_fit_params)
# get coefs
if self.importance_type == 'feature_importances':
coefs = _feature_importances(estimator)
else:
if eval_importance:
coefs = _shap_importances(
estimator, _fit_params['eval_set'][-1][0])
else:
coefs = _shap_importances(
estimator, self.transform(X))
ranks = np.argsort(coefs)
# eliminate the worse features
threshold = min(step, np.sum(self.support_) - min_features_to_select)
# compute step score on the previous selection iteration
# because 'estimator' must use features
# that have not been eliminated yet
if scoring:
score = self._step_score(estimator)
self.score_history_.append(score)
if best_score != eval_score([score, best_score]):
best_score = score
best_support = self.support_.copy()
best_ranking = self.ranking_.copy()
best_estimator = estimator
self.support_[features[ranks][:threshold]] = False
self.ranking_[np.logical_not(self.support_)] += 1
# set final attributes
_fit_params, self.estimator_ = self._check_fit_params(fit_params)
if self.verbose > 1:
print("Fitting estimator with {} features".format(self.support_.sum()))
with contextlib.redirect_stdout(io.StringIO()):
self.estimator_.fit(self.transform(X), y, **_fit_params)
# compute step score when only min_features_to_select features left
if scoring:
score = self._step_score(self.estimator_)
self.score_history_.append(score)
if best_score == eval_score([score, best_score]):
self.support_ = best_support
self.ranking_ = best_ranking
self.estimator_ = best_estimator
self.n_features_ = self.support_.sum()
return self
class _RFA(_BoostSelector):
"""Base class for BoostRFA meta-estimator.
Warning: This class should not be used directly. Use derived classes
instead.
"""
def __init__(self,
estimator, *,
min_features_to_select=None,
step=1,
greater_is_better=False,
importance_type='feature_importances',
train_importance=True,
verbose=0):
self.estimator = estimator
self.min_features_to_select = min_features_to_select
self.step = step
self.greater_is_better = greater_is_better
self.importance_type = importance_type
self.train_importance = train_importance
self.verbose = verbose
def _check_fit_params(self, fit_params, inverse=False):
"""Private method to validate and check fit_params."""
_fit_params = deepcopy(fit_params)
estimator = clone(self.estimator)
# add here possible estimator checks in each iteration
_fit_params = _set_categorical_indexes(
self.support_, self._cat_support, _fit_params)
if 'eval_set' in _fit_params:
_fit_params['eval_set'] = list(map(lambda x: (
self._transform(x[0], inverse), x[1]
), _fit_params['eval_set']))
if 'feature_name' in _fit_params: # LGB
_fit_params['feature_name'] = 'auto'
if 'feature_weights' in _fit_params: # XGB import warnings
warnings.warn(
"feature_weights is not supported when selecting features. "
"It's automatically set to None.")
_fit_params['feature_weights'] = None
return _fit_params, estimator
def _step_score(self, estimator):
"""Return the score for a fit on eval_set."""
if self.boost_type_ == 'LGB':
valid_id = list(estimator.best_score_.keys())[-1]
eval_metric = list(estimator.best_score_[valid_id])[-1]
score = estimator.best_score_[valid_id][eval_metric]
else:
# w/ eval_set and w/ early_stopping_rounds
if hasattr(estimator, 'best_score'):
score = estimator.best_score
# w/ eval_set and w/o early_stopping_rounds
else:
valid_id = list(estimator.evals_result_.keys())[-1]
eval_metric = list(estimator.evals_result_[valid_id])[-1]
score = estimator.evals_result_[valid_id][eval_metric][-1]
return score
def fit(self, X, y, **fit_params):
"""Fit the RFA algorithm to automatically tune
the number of selected features."""
self.boost_type_ = _check_boosting(self.estimator)
importances = ['feature_importances', 'shap_importances']
if self.importance_type not in importances:
raise ValueError(
"importance_type must be one of {}. Get '{}'".format(
importances, self.importance_type))
# scoring controls the calculation of self.score_history_
# scoring is used automatically when 'eval_set' is in fit_params
scoring = 'eval_set' in fit_params
if self.importance_type == 'shap_importances':
if not self.train_importance and not scoring:
raise ValueError(
"When train_importance is set to False, using "
"shap_importances, pass at least a eval_set.")
eval_importance = not self.train_importance and scoring
shapes = np.shape(X)
if len(shapes) != 2:
raise ValueError("X must be 2D.")
n_features = shapes[1]
# create mask for user-defined categorical features
self._cat_support = _get_categorical_support(n_features, fit_params)
if self.min_features_to_select is None:
if scoring:
min_features_to_select = 1
else:
min_features_to_select = n_features // 2
else:
if scoring:
min_features_to_select = self.min_features_to_select
else:
min_features_to_select = n_features - self.min_features_to_select
if 0.0 < self.step < 1.0:
step = int(max(1, self.step * n_features))
else:
step = int(self.step)
if step <= 0:
raise ValueError("Step must be >0.")
self.support_ = np.zeros(n_features, dtype=bool)
self._support = np.ones(n_features, dtype=bool)
self.ranking_ = np.ones(n_features, dtype=int)
self._ranking = np.ones(n_features, dtype=int)
if scoring:
self.score_history_ = []
eval_score = np.max if self.greater_is_better else np.min
best_score = -np.inf if self.greater_is_better else np.inf
while np.sum(self._support) > min_features_to_select:
# remaining features
features = np.arange(n_features)[self._support]
# scoring the previous added features
if scoring and np.sum(self.support_) > 0:
_fit_params, estimator = self._check_fit_params(fit_params)
with contextlib.redirect_stdout(io.StringIO()):
estimator.fit(self._transform(X, inverse=False), y, **_fit_params)
score = self._step_score(estimator)
self.score_history_.append(score)
if best_score != eval_score([score, best_score]):
best_score = score
best_support = self.support_.copy()
best_ranking = self.ranking_.copy()
best_estimator = estimator
# evaluate the remaining features
_fit_params, _estimator = self._check_fit_params(fit_params, inverse=True)
if self.verbose > 1:
print("Fitting estimator with {} features".format(self._support.sum()))
with contextlib.redirect_stdout(io.StringIO()):
_estimator.fit(self._transform(X, inverse=True), y, **_fit_params)
if self._support.sum() == n_features:
all_features_estimator = _estimator
# get coefs
if self.importance_type == 'feature_importances':
coefs = _feature_importances(_estimator)
else:
if eval_importance:
coefs = _shap_importances(
_estimator, _fit_params['eval_set'][-1][0])
else:
coefs = _shap_importances(
_estimator, self._transform(X, inverse=True))
ranks = np.argsort(-coefs) # the rank is inverted
# add the best features
threshold = min(step, np.sum(self._support) - min_features_to_select)
# remaining features to test
self._support[features[ranks][:threshold]] = False
self._ranking[np.logical_not(self._support)] += 1
# features tested
self.support_[features[ranks][:threshold]] = True
self.ranking_[np.logical_not(self.support_)] += 1
# set final attributes
_fit_params, self.estimator_ = self._check_fit_params(fit_params)
if self.verbose > 1:
print("Fitting estimator with {} features".format(self._support.sum()))
with contextlib.redirect_stdout(io.StringIO()):
self.estimator_.fit(self._transform(X, inverse=False), y, **_fit_params)
# compute step score when only min_features_to_select features left
if scoring:
score = self._step_score(self.estimator_)
self.score_history_.append(score)
if best_score == eval_score([score, best_score]):
self.support_ = best_support
self.ranking_ = best_ranking
self.estimator_ = best_estimator
if len(set(self.score_history_)) == 1:
self.support_ = np.ones(n_features, dtype=bool)
self.ranking_ = np.ones(n_features, dtype=int)
self.estimator_ = all_features_estimator
self.n_features_ = self.support_.sum()
return self
def _transform(self, X, inverse=False):
"""Private method to reduce the input X to the features selected."""
shapes = np.shape(X)
if len(shapes) != 2:
raise ValueError("X must be 2D.")
if shapes[1] != self.support_.shape[0]:
raise ValueError(
"Expected {} features, received {}.".format(
self.support_.shape[0], shapes[1]))
if inverse:
if isinstance(X, np.ndarray):
return X[:, self._support]
elif hasattr(X, 'loc'):
return X.loc[:, self._support]
elif sp.sparse.issparse(X):
return X[:, self._support]
else:
raise ValueError("Data type not understood.")
else:
if isinstance(X, np.ndarray):
return X[:, self.support_]
elif hasattr(X, 'loc'):
return X.loc[:, self.support_]
elif sp.sparse.issparse(X):
return X[:, self.support_]
else:
raise ValueError("Data type not understood.")
def transform(self, X):
"""Reduces the input X to the features selected with RFA.
Parameters
----------
X : array-like of shape (n_samples, n_features)
Samples.
Returns
-------
X : array-like of shape (n_samples, n_features_)
The input samples with only the selected features by Boruta.
"""
check_is_fitted(self)
return self._transform(X, inverse=False)
================================================
FILE: shaphypetune/shaphypetune.py
================================================
from sklearn.base import clone
from ._classes import _BoostSearch, _Boruta, _RFA, _RFE
class BoostSearch(_BoostSearch):
"""Hyperparamater searching and optimization on a given validation set
for LGBModel or XGBModel.
Pass a LGBModel or XGBModel, and a dictionary with the parameter boundaries
for grid, random or bayesian search.
To operate random search pass distributions in the param_grid with rvs
method for sampling (such as those from scipy.stats.distributions).
To operate bayesian search pass hyperopt distributions.
The specification of n_iter or sampling_seed is effective only with random
or hyperopt searches.
The best parameter combination is the one which obtain the better score
(as returned by eval_metric) on the provided eval_set.
If all parameters are presented as a list/floats/integers, grid-search
is performed. If at least one parameter is given as a distribution (such as
those from scipy.stats.distributions), random-search is performed computing
sampling with replacement. Bayesian search is effective only when all the
parameters to tune are in form of hyperopt distributions.
It is highly recommended to use continuous distributions for continuous
parameters.
Parameters
----------
estimator : object
A supervised learning estimator of LGBModel or XGBModel type.
param_grid : dict
Dictionary with parameters names (`str`) as keys and distributions
or lists of parameters to try.
greater_is_better : bool, default=False
Whether the quantity to monitor is a score function,
meaning high is good, or a loss function, meaning low is good.
n_iter : int, default=None
Effective only for random or hyperopt search.
Number of parameter settings that are sampled.
n_iter trades off runtime vs quality of the solution.
sampling_seed : int, default=None
Effective only for random or hyperopt search.
The seed used to sample from the hyperparameter distributions.
n_jobs : int, default=None
Effective only with grid and random search.
The number of jobs to run in parallel for model fitting.
``None`` means 1 using one processor. ``-1`` means using all
processors.
verbose : int, default=1
Verbosity mode. <=0 silent all; >0 print trial logs with the
connected score.
Attributes
----------
estimator_ : estimator
Estimator that was chosen by the search, i.e. estimator
which gave the best score on the eval_set.
best_params_ : dict
Parameter setting that gave the best results on the eval_set.
trials_ : list
A list of dicts. The dicts are all the parameter combinations tried
and derived from the param_grid.
best_score_ : float
The best score achieved by all the possible combination created.
scores_ : list
The scores achieved on the eval_set by all the models tried.
best_iter_ : int
The boosting iterations achieved by the best parameters combination.
iterations_ : list
The boosting iterations of all the models tried.
boost_type_ : str
The type of the boosting estimator (LGB or XGB).
"""
def __init__(self,
estimator, *,
param_grid,
greater_is_better=False,
n_iter=None,
sampling_seed=None,
verbose=1,
n_jobs=None):
self.estimator = estimator
self.param_grid = param_grid
self.greater_is_better = greater_is_better
self.n_iter = n_iter
self.sampling_seed = sampling_seed
self.verbose = verbose
self.n_jobs = n_jobs
def _build_model(self, params):
"""Private method to build model."""
model = clone(self.estimator)
model.set_params(**params)
return model
class BoostBoruta(_BoostSearch, _Boruta):
"""Simultaneous features selection with Boruta algorithm and hyperparamater
searching on a given validation set for LGBModel or XGBModel.
Pass a LGBModel or XGBModel to compute features selection with Boruta
algorithm. The best features are used to train a new gradient boosting
instance. When a eval_set is provided, shadow features are build also on it.
If param_grid is a dictionary with parameter boundaries, a hyperparameter
tuning is computed simultaneously. The parameter combinations are scored on
the provided eval_set.
To operate random search pass distributions in the param_grid with rvs
method for sampling (such as those from scipy.stats.distributions).
To operate bayesian search pass hyperopt distributions.
The specification of n_iter or sampling_seed is effective only with random
or hyperopt searches.
The best parameter combination is the one which obtain the better score
(as returned by eval_metric) on the provided eval_set.
If all parameters are presented as a list/floats/integers, grid-search
is performed. If at least one parameter is given as a distribution (such as
those from scipy.stats.distributions), random-search is performed computing
sampling with replacement. Bayesian search is effective only when all the
parameters to tune are in form of hyperopt distributions.
It is highly recommended to use continuous distributions for continuous
parameters.
Parameters
----------
estimator : object
A supervised learning estimator of LGBModel or XGBModel type.
perc : int, default=100
Threshold for comparison between shadow and real features.
The lower perc is the more false positives will be picked as relevant
but also the less relevant features will be left out.
100 correspond to the max.
alpha : float, default=0.05
Level at which the corrected p-values will get rejected in the
correction steps.
max_iter : int, default=100
The number of maximum Boruta iterations to perform.
early_stopping_boruta_rounds : int, default=None
The maximum amount of iterations without confirming a tentative
feature. Use early stopping to terminate the selection process
before reaching `max_iter` iterations if the algorithm cannot
confirm a tentative feature after N iterations.
None means no early stopping search.
importance_type : str, default='feature_importances'
Which importance measure to use. It can be 'feature_importances'
(the default feature importance of the gradient boosting estimator)
or 'shap_importances'.
train_importance : bool, default=True
Effective only when importance_type='shap_importances'.
Where to compute the shap feature importance: on train (True)
or on eval_set (False).
param_grid : dict, default=None
Dictionary with parameters names (`str`) as keys and distributions
or lists of parameters to try.
None means no hyperparameters search.
greater_is_better : bool, default=False
Effective only when hyperparameters searching.
Whether the quantity to monitor is a score function,
meaning high is good, or a loss function, meaning low is good.
n_iter : int, default=None
Effective only when hyperparameters searching.
Effective only for random or hyperopt seraches.
Number of parameter settings that are sampled.
n_iter trades off runtime vs quality of the solution.
sampling_seed : int, default=None
Effective only when hyperparameters searching.
Effective only for random or hyperopt serach.
The seed used to sample from the hyperparameter distributions.
n_jobs : int, default=None
Effective only when hyperparameters searching without hyperopt.
The number of jobs to run in parallel for model fitting.
``None`` means 1 using one processor. ``-1`` means using all
processors.
verbose : int, default=1
Verbosity mode. <=0 silent all; ==1 print trial logs (when
hyperparameters searching); >1 print feature selection logs plus
trial logs (when hyperparameters searching).
Attributes
----------
estimator_ : estimator
The fitted estimator with the select features and the optimal
parameter combination (when hyperparameters searching).
n_features_ : int
The number of selected features (from the best param config
when hyperparameters searching).
ranking_ : ndarray of shape (n_features,)
The feature ranking, such that ``ranking_[i]`` corresponds to the
ranking position of the i-th feature (from the best param config
when hyperparameters searching). Selected features are assigned
rank 1 (2: tentative upper bound, 3: tentative lower bound, 4:
rejected).
support_ : ndarray of shape (n_features,)
The mask of selected features (from the best param config
when hyperparameters searching).
importance_history_ : ndarray of shape (n_features, n_iters)
The importance values for each feature across all iterations.
best_params_ : dict
Available only when hyperparameters searching.
Parameter setting that gave the best results on the eval_set.
trials_ : list
Available only when hyperparameters searching.
A list of dicts. The dicts are all the parameter combinations tried
and derived from the param_grid.
best_score_ : float
Available only when hyperparameters searching.
The best score achieved by all the possible combination created.
scores_ : list
Available only when hyperparameters searching.
The scores achived on the eval_set by all the models tried.
best_iter_ : int
Available only when hyperparameters searching.
The boosting iterations achieved by the best parameters combination.
iterations_ : list
Available only when hyperparameters searching.
The boosting iterations of all the models tried.
boost_type_ : str
The type of the boosting estimator (LGB or XGB).
Notes
-----
The code for the Boruta algorithm is inspired and improved from:
https://github.com/scikit-learn-contrib/boruta_py
"""
def __init__(self,
estimator, *,
perc=100,
alpha=0.05,
max_iter=100,
early_stopping_boruta_rounds=None,
param_grid=None,
greater_is_better=False,
importance_type='feature_importances',
train_importance=True,
n_iter=None,
sampling_seed=None,
verbose=1,
n_jobs=None):
self.estimator = estimator
self.perc = perc
self.alpha = alpha
self.max_iter = max_iter
self.early_stopping_boruta_rounds = early_stopping_boruta_rounds
self.param_grid = param_grid
self.greater_is_better = greater_is_better
self.importance_type = importance_type
self.train_importance = train_importance
self.n_iter = n_iter
self.sampling_seed = sampling_seed
self.verbose = verbose
self.n_jobs = n_jobs
def _build_model(self, params=None):
"""Private method to build model."""
estimator = clone(self.estimator)
if params is None:
model = _Boruta(
estimator=estimator,
perc=self.perc,
alpha=self.alpha,
max_iter=self.max_iter,
early_stopping_boruta_rounds=self.early_stopping_boruta_rounds,
importance_type=self.importance_type,
train_importance=self.train_importance,
verbose=self.verbose
)
else:
estimator.set_params(**params)
model = _Boruta(
estimator=estimator,
perc=self.perc,
alpha=self.alpha,
max_iter=self.max_iter,
early_stopping_boruta_rounds=self.early_stopping_boruta_rounds,
importance_type=self.importance_type,
train_importance=self.train_importance,
verbose=self.verbose
)
return model
class BoostRFE(_BoostSearch, _RFE):
"""Simultaneous features selection with RFE and hyperparamater searching
on a given validation set for LGBModel or XGBModel.
Pass a LGBModel or XGBModel to compute features selection with RFE.
The gradient boosting instance with the best features is selected.
When a eval_set is provided, the best gradient boosting and the best
features are obtained evaluating the score with eval_metric.
Otherwise, the best combination is obtained looking only at feature
importance.
If param_grid is a dictionary with parameter boundaries, a hyperparameter
tuning is computed simultaneously. The parameter combinations are scored on
the provided eval_set.
To operate random search pass distributions in the param_grid with rvs
method for sampling (such as those from scipy.stats.distributions).
To operate bayesian search pass hyperopt distributions.
The specification of n_iter or sampling_seed is effective only with random
or hyperopt searches.
The best parameter combination is the one which obtain the better score
(as returned by eval_metric) on the provided eval_set.
If all parameters are presented as a list/floats/integers, grid-search
is performed. If at least one parameter is given as a distribution (such as
those from scipy.stats.distributions), random-search is performed computing
sampling with replacement. Bayesian search is effective only when all the
parameters to tune are in form of hyperopt distributions.
It is highly recommended to use continuous distributions for continuous
parameters.
Parameters
----------
estimator : object
A supervised learning estimator of LGBModel or XGBModel type.
step : int or float, default=1
If greater than or equal to 1, then `step` corresponds to the
(integer) number of features to remove at each iteration.
If within (0.0, 1.0), then `step` corresponds to the percentage
(rounded down) of features to remove at each iteration.
Note that the last iteration may remove fewer than `step` features in
order to reach `min_features_to_select`.
min_features_to_select : int, default=None
The minimum number of features to be selected. This number of features
will always be scored, even if the difference between the original
feature count and `min_features_to_select` isn't divisible by
`step`. The default value for min_features_to_select is set to 1 when a
eval_set is provided, otherwise it always corresponds to n_features // 2.
importance_type : str, default='feature_importances'
Which importance measure to use. It can be 'feature_importances'
(the default feature importance of the gradient boosting estimator)
or 'shap_importances'.
train_importance : bool, default=True
Effective only when importance_type='shap_importances'.
Where to compute the shap feature importance: on train (True)
or on eval_set (False).
param_grid : dict, default=None
Dictionary with parameters names (`str`) as keys and distributions
or lists of parameters to try.
None means no hyperparameters search.
greater_is_better : bool, default=False
Effective only when hyperparameters searching.
Whether the quantity to monitor is a score function,
meaning high is good, or a loss function, meaning low is good.
n_iter : int, default=None
Effective only when hyperparameters searching.
Effective only for random or hyperopt serach.
Number of parameter settings that are sampled.
n_iter trades off runtime vs quality of the solution.
sampling_seed : int, default=None
Effective only when hyperparameters searching.
Effective only for random or hyperopt serach.
The seed used to sample from the hyperparameter distributions.
n_jobs : int, default=None
Effective only when hyperparameters searching without hyperopt.
The number of jobs to run in parallel for model fitting.
``None`` means 1 using one processor. ``-1`` means using all
processors.
verbose : int, default=1
Verbosity mode. <=0 silent all; ==1 print trial logs (when
hyperparameters searching); >1 print feature selection logs plus
trial logs (when hyperparameters searching).
Attributes
----------
estimator_ : estimator
The fitted estimator with the select features and the optimal
parameter combination (when hyperparameters searching).
n_features_ : int
The number of selected features (from the best param config
when hyperparameters searching).
ranking_ : ndarray of shape (n_features,)
The feature ranking, such that ``ranking_[i]`` corresponds to the
ranking position of the i-th feature (from the best param config
when hyperparameters searching). Selected features are assigned
rank 1.
support_ : ndarray of shape (n_features,)
The mask of selected features (from the best param config
when hyperparameters searching).
score_history_ : list
Available only when a eval_set is provided.
Scores obtained reducing the features (from the best param config
when hyperparameters searching).
best_params_ : dict
Available only when hyperparameters searching.
Parameter setting that gave the best results on the eval_set.
trials_ : list
Available only when hyperparameters searching.
A list of dicts. The dicts are all the parameter combinations tried
and derived from the param_grid.
best_score_ : float
Available only when hyperparameters searching.
The best score achieved by all the possible combination created.
scores_ : list
Available only when hyperparameters searching.
The scores achieved on the eval_set by all the models tried.
best_iter_ : int
Available only when hyperparameters searching.
The boosting iterations achieved by the best parameters combination.
iterations_ : list
Available only when hyperparameters searching.
The boosting iterations of all the models tried.
boost_type_ : str
The type of the boosting estimator (LGB or XGB).
"""
def __init__(self,
estimator, *,
min_features_to_select=None,
step=1,
param_grid=None,
greater_is_better=False,
importance_type='feature_importances',
train_importance=True,
n_iter=None,
sampling_seed=None,
verbose=1,
n_jobs=None):
self.estimator = estimator
self.min_features_to_select = min_features_to_select
self.step = step
self.param_grid = param_grid
self.greater_is_better = greater_is_better
self.importance_type = importance_type
self.train_importance = train_importance
self.n_iter = n_iter
self.sampling_seed = sampling_seed
self.verbose = verbose
self.n_jobs = n_jobs
def _build_model(self, params=None):
"""Private method to build model."""
estimator = clone(self.estimator)
if params is None:
model = _RFE(
estimator=estimator,
min_features_to_select=self.min_features_to_select,
step=self.step,
greater_is_better=self.greater_is_better,
importance_type=self.importance_type,
train_importance=self.train_importance,
verbose=self.verbose
)
else:
estimator.set_params(**params)
model = _RFE(
estimator=estimator,
min_features_to_select=self.min_features_to_select,
step=self.step,
greater_is_better=self.greater_is_better,
importance_type=self.importance_type,
train_importance=self.train_importance,
verbose=self.verbose
)
return model
class BoostRFA(_BoostSearch, _RFA):
"""Simultaneous features selection with RFA and hyperparamater searching
on a given validation set for LGBModel or XGBModel.
Pass a LGBModel or XGBModel to compute features selection with RFA.
The gradient boosting instance with the best features is selected.
When a eval_set is provided, the best gradient boosting and the best
features are obtained evaluating the score with eval_metric.
Otherwise, the best combination is obtained looking only at feature
importance.
If param_grid is a dictionary with parameter boundaries, a hyperparameter
tuning is computed simultaneously. The parameter combinations are scored on
the provided eval_set.
To operate random search pass distributions in the param_grid with rvs
method for sampling (such as those from scipy.stats.distributions).
To operate bayesian search pass hyperopt distributions.
The specification of n_iter or sampling_seed is effective only with random
or hyperopt searches.
The best parameter combination is the one which obtain the better score
(as returned by eval_metric) on the provided eval_set.
If all parameters are presented as a list/floats/integers, grid-search
is performed. If at least one parameter is given as a distribution (such as
those from scipy.stats.distributions), random-search is performed computing
sampling with replacement. Bayesian search is effective only when all the
parameters to tune are in form of hyperopt distributions.
It is highly recommended to use continuous distributions for continuous
parameters.
Parameters
----------
estimator : object
A supervised learning estimator of LGBModel or XGBModel type.
step : int or float, default=1
If greater than or equal to 1, then `step` corresponds to the
(integer) number of features to remove at each iteration.
If within (0.0, 1.0), then `step` corresponds to the percentage
(rounded down) of features to remove at each iteration.
Note that the last iteration may remove fewer than `step` features in
order to reach `min_features_to_select`.
min_features_to_select : int, default=None
The minimum number of features to be selected. This number of features
will always be scored, even if the difference between the original
feature count and `min_features_to_select` isn't divisible by
`step`. The default value for min_features_to_select is set to 1 when a
eval_set is provided, otherwise it always corresponds to n_features // 2.
importance_type : str, default='feature_importances'
Which importance measure to use. It can be 'feature_importances'
(the default feature importance of the gradient boosting estimator)
or 'shap_importances'.
train_importance : bool, default=True
Effective only when importance_type='shap_importances'.
Where to compute the shap feature importance: on train (True)
or on eval_set (False).
param_grid : dict, default=None
Dictionary with parameters names (`str`) as keys and distributions
or lists of parameters to try.
None means no hyperparameters search.
greater_is_better : bool, default=False
Effective only when hyperparameters searching.
Whether the quantity to monitor is a score function,
meaning high is good, or a loss function, meaning low is good.
n_iter : int, default=None
Effective only when hyperparameters searching.
Effective only for random or hyperopt serach.
Number of parameter settings that are sampled.
n_iter trades off runtime vs quality of the solution.
sampling_seed : int, default=None
Effective only when hyperparameters searching.
Effective only for random or hyperopt serach.
The seed used to sample from the hyperparameter distributions.
n_jobs : int, default=None
Effective only when hyperparameters searching without hyperopt.
The number of jobs to run in parallel for model fitting.
``None`` means 1 using one processor. ``-1`` means using all
processors.
verbose : int, default=1
Verbosity mode. <=0 silent all; ==1 print trial logs (when
hyperparameters searching); >1 print feature selection logs plus
trial logs (when hyperparameters searching).
Attributes
----------
estimator_ : estimator
The fitted estimator with the select features and the optimal
parameter combination (when hyperparameters searching).
n_features_ : int
The number of selected features (from the best param config
when hyperparameters searching).
ranking_ : ndarray of shape (n_features,)
The feature ranking, such that ``ranking_[i]`` corresponds to the
ranking position of the i-th feature (from the best param config
when hyperparameters searching). Selected features are assigned
rank 1.
support_ : ndarray of shape (n_features,)
The mask of selected features (from the best param config
when hyperparameters searching).
score_history_ : list
Available only when a eval_set is provided.
Scores obtained reducing the features (from the best param config
when hyperparameters searching).
best_params_ : dict
Available only when hyperparameters searching.
Parameter setting that gave the best results on the eval_set.
trials_ : list
Available only when hyperparameters searching.
A list of dicts. The dicts are all the parameter combinations tried
and derived from the param_grid.
best_score_ : float
Available only when hyperparameters searching.
The best score achieved by all the possible combination created.
scores_ : list
Available only when hyperparameters searching.
The scores achieved on the eval_set by all the models tried.
best_iter_ : int
Available only when hyperparameters searching.
The boosting
gitextract_liu6w4f0/
├── .gitignore
├── LICENSE
├── README.md
├── notebooks/
│ ├── LGBM_usage.ipynb
│ └── XGBoost_usage.ipynb
├── requirements.txt
├── setup.py
└── shaphypetune/
├── __init__.py
├── _classes.py
├── shaphypetune.py
└── utils.py
SYMBOL INDEX (51 symbols across 3 files)
FILE: shaphypetune/_classes.py
class _BoostSearch (line 20) | class _BoostSearch(BaseEstimator):
method __init__ (line 27) | def __init__(self):
method _validate_param_grid (line 30) | def _validate_param_grid(self, fit_params):
method _fit (line 62) | def _fit(self, X, y, fit_params, params=None):
method fit (line 134) | def fit(self, X, y, trials=None, **fit_params):
method predict (line 220) | def predict(self, X, **predict_params):
method predict_proba (line 243) | def predict_proba(self, X, **predict_params):
method score (line 269) | def score(self, X, y, sample_weight=None):
class _BoostSelector (line 297) | class _BoostSelector(BaseEstimator, TransformerMixin):
method __init__ (line 304) | def __init__(self):
method transform (line 307) | def transform(self, X):
method get_support (line 339) | def get_support(self, indices=False):
class _Boruta (line 365) | class _Boruta(_BoostSelector):
method __init__ (line 377) | def __init__(self,
method _create_X (line 396) | def _create_X(self, X, feat_id_real):
method _check_fit_params (line 419) | def _check_fit_params(self, fit_params, feat_id_real=None):
method _do_tests (line 451) | def _do_tests(self, dec_reg, hit_reg, iter_id):
method fit (line 475) | def fit(self, X, y, **fit_params):
class _RFE (line 622) | class _RFE(_BoostSelector):
method __init__ (line 629) | def __init__(self,
method _check_fit_params (line 646) | def _check_fit_params(self, fit_params):
method _step_score (line 672) | def _step_score(self, estimator):
method fit (line 691) | def fit(self, X, y, **fit_params):
class _RFA (line 804) | class _RFA(_BoostSelector):
method __init__ (line 811) | def __init__(self,
method _check_fit_params (line 828) | def _check_fit_params(self, fit_params, inverse=False):
method _step_score (line 854) | def _step_score(self, estimator):
method fit (line 873) | def fit(self, X, y, **fit_params):
method _transform (line 1002) | def _transform(self, X, inverse=False):
method transform (line 1033) | def transform(self, X):
FILE: shaphypetune/shaphypetune.py
class BoostSearch (line 6) | class BoostSearch(_BoostSearch):
method __init__ (line 89) | def __init__(self,
method _build_model (line 105) | def _build_model(self, params):
class BoostBoruta (line 114) | class BoostBoruta(_BoostSearch, _Boruta):
method __init__ (line 266) | def __init__(self,
method _build_model (line 295) | def _build_model(self, params=None):
class BoostRFE (line 328) | class BoostRFE(_BoostSearch, _RFE):
method __init__ (line 474) | def __init__(self,
method _build_model (line 499) | def _build_model(self, params=None):
class BoostRFA (line 530) | class BoostRFA(_BoostSearch, _RFA):
method __init__ (line 681) | def __init__(self,
method _build_model (line 706) | def _build_model(self, params=None):
FILE: shaphypetune/utils.py
function _check_boosting (line 8) | def _check_boosting(model):
function _shap_importances (line 27) | def _shap_importances(model, X):
function _feature_importances (line 49) | def _feature_importances(model):
function _get_categorical_support (line 65) | def _get_categorical_support(n_features, fit_params):
function _set_categorical_indexes (line 87) | def _set_categorical_indexes(support, cat_support, _fit_params,
function _check_param (line 109) | def _check_param(values):
class ParameterSampler (line 127) | class ParameterSampler(object):
method __init__ (line 161) | def __init__(self, param_distributions, n_iter=None, random_state=None):
method sample (line 167) | def sample(self):
Condensed preview — 11 files, each showing path, character count, and a content snippet. Download the .json file or copy for the full structured content (219K chars).
[
{
"path": ".gitignore",
"chars": 1290,
"preview": ".DS_Store\n\n# Created by https://www.gitignore.io/api/python\n\n### Python ###\n# Byte-compiled / optimized / DLL files\n__py"
},
{
"path": "LICENSE",
"chars": 1071,
"preview": "MIT License\n\nCopyright (c) 2021 Marco Cerliani\n\nPermission is hereby granted, free of charge, to any person obtaining a "
},
{
"path": "README.md",
"chars": 6232,
"preview": "# shap-hypetune\nA python package for simultaneous Hyperparameters Tuning and Features Selection for Gradient Boosting Mo"
},
{
"path": "notebooks/LGBM_usage.ipynb",
"chars": 52891,
"preview": "{\"metadata\":{\"kernelspec\":{\"language\":\"python\",\"display_name\":\"Python 3\",\"name\":\"python3\"},\"language_info\":{\"name\":\"pyth"
},
{
"path": "notebooks/XGBoost_usage.ipynb",
"chars": 69600,
"preview": "{\"metadata\":{\"kernelspec\":{\"language\":\"python\",\"display_name\":\"Python 3\",\"name\":\"python3\"},\"language_info\":{\"name\":\"pyth"
},
{
"path": "requirements.txt",
"chars": 61,
"preview": "numpy\nscipy\nscikit-learn>=0.24.1\nshap>=0.39.0\nhyperopt==0.2.5"
},
{
"path": "setup.py",
"chars": 984,
"preview": "import pathlib\nfrom setuptools import setup, find_packages\n\nHERE = pathlib.Path(__file__).parent\n\nVERSION = '0.2.7'\nPACK"
},
{
"path": "shaphypetune/__init__.py",
"chars": 72,
"preview": "from .utils import *\nfrom ._classes import *\nfrom .shaphypetune import *"
},
{
"path": "shaphypetune/_classes.py",
"chars": 39541,
"preview": "import io\nimport contextlib\nimport warnings\nimport numpy as np\nimport scipy as sp\nfrom copy import deepcopy\n\nfrom sklear"
},
{
"path": "shaphypetune/shaphypetune.py",
"chars": 29299,
"preview": "from sklearn.base import clone\n\nfrom ._classes import _BoostSearch, _Boruta, _RFA, _RFE\n\n\nclass BoostSearch(_BoostSearch"
},
{
"path": "shaphypetune/utils.py",
"chars": 7421,
"preview": "import random\nimport numpy as np\nfrom itertools import product\n\nfrom shap import TreeExplainer\n\n\ndef _check_boosting(mod"
}
]
About this extraction
This page contains the full source code of the cerlymarco/shap-hypetune GitHub repository, extracted and formatted as plain text for AI agents and large language models (LLMs). The extraction includes 11 files (203.6 KB), approximately 56.4k tokens, and a symbol index with 51 extracted functions, classes, methods, constants, and types. Use this with OpenClaw, Claude, ChatGPT, Cursor, Windsurf, or any other AI tool that accepts text input. You can copy the full output to your clipboard or download it as a .txt file.
Extracted by GitExtract — free GitHub repo to text converter for AI. Built by Nikandr Surkov.