Repository: cerlymarco/shap-hypetune
Branch: main
Commit: 8f46e161d27a
Files: 11
Total size: 203.6 KB

Directory structure:
gitextract_liu6w4f0/

├── .gitignore
├── LICENSE
├── README.md
├── notebooks/
│   ├── LGBM_usage.ipynb
│   └── XGBoost_usage.ipynb
├── requirements.txt
├── setup.py
└── shaphypetune/
    ├── __init__.py
    ├── _classes.py
    ├── shaphypetune.py
    └── utils.py

================================================
FILE CONTENTS
================================================

================================================
FILE: .gitignore
================================================
.DS_Store

# Created by https://www.gitignore.io/api/python

### Python ###
# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
*$py.class

# C extensions
*.so

# Distribution / packaging
.Python
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
wheels/
*.egg-info/
.installed.cfg
*.egg

# PyInstaller
#  Usually these files are written by a python script from a template
#  before PyInstaller builds the exe, so as to inject date/other infos into it.
*.manifest
*.spec

# Installer logs
pip-log.txt
pip-delete-this-directory.txt

# Unit test / coverage reports
htmlcov/
.tox/
.coverage
.coverage.*
.cache
nosetests.xml
coverage.xml
*.cover
.hypothesis/

# Translations
*.mo
*.pot

# Django stuff:
*.log
local_settings.py

# Flask stuff:
instance/
.webassets-cache

# Scrapy stuff:
.scrapy

# Sphinx documentation
docs/_build/

# PyBuilder
target/

# Jupyter Notebook
.ipynb_checkpoints

# pyenv
.python-version

# celery beat schedule file
celerybeat-schedule

# SageMath parsed files
*.sage.py

# Environments
.env
.venv
env/
venv/
ENV/
env.bak/
venv.bak/

# Spyder project settings
.spyderproject
.spyproject

# Rope project settings
.ropeproject

# mkdocs documentation
/site

# mypy
.mypy_cache/

# End of https://www.gitignore.io/api/python


================================================
FILE: LICENSE
================================================
MIT License

Copyright (c) 2021 Marco Cerliani

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.


================================================
FILE: README.md
================================================
# shap-hypetune
A python package for simultaneous Hyperparameters Tuning and Features Selection for Gradient Boosting Models.

![shap-hypetune diagram](https://raw.githubusercontent.com/cerlymarco/shap-hypetune/master/imgs/shap-hypetune-diagram.png#center)

## Overview
Hyperparameters tuning and features selection are two common steps in every machine learning pipeline. Most of the time they are computed separately and independently. This may result in suboptimal performances and in a more time expensive process.

shap-hypetune aims to combine hyperparameters tuning and features selection in a single pipeline optimizing the optimal number of features while searching for the optimal parameters configuration. Hyperparameters Tuning or Features Selection can also be carried out as standalone operations.

**shap-hypetune main features:**

- designed for gradient boosting models, as LGBModel or XGBModel;
- developed to be integrable with the scikit-learn ecosystem;
- effective in both classification or regression tasks;
- customizable training process, supporting early-stopping and all the other fitting options available in the standard algorithms api;
- ranking feature selection algorithms: Recursive Feature Elimination (RFE); Recursive Feature Addition (RFA); or Boruta;
- classical boosting based feature importances or SHAP feature importances (the later can be computed also on the eval_set);
- apply grid-search, random-search, or bayesian-search (from hyperopt);
- parallelized computations with joblib.

## Installation
```shell
pip install --upgrade shap-hypetune
```
lightgbm, xgboost are not needed requirements. The module depends only on NumPy, shap, scikit-learn and hyperopt. Python 3.6 or above is supported.

## Media
- [SHAP for Feature Selection and HyperParameter Tuning](https://towardsdatascience.com/shap-for-feature-selection-and-hyperparameter-tuning-a330ec0ea104)
- [Boruta and SHAP for better Feature Selection](https://towardsdatascience.com/boruta-and-shap-for-better-feature-selection-20ea97595f4a)
- [Recursive Feature Selection: Addition or Elimination?](https://towardsdatascience.com/recursive-feature-selection-addition-or-elimination-755e5d86a791)
- [Boruta SHAP for Temporal Feature Selection](https://towardsdatascience.com/boruta-shap-for-temporal-feature-selection-96a7840c7713)

## Usage
```python
from shaphypetune import BoostSearch, BoostRFE, BoostRFA, BoostBoruta
```
#### Hyperparameters Tuning
```python
BoostSearch(
    estimator,                              # LGBModel or XGBModel
    param_grid=None,                        # parameters to be optimized
    greater_is_better=False,                # minimize or maximize the monitored score
    n_iter=None,                            # number of sampled parameter configurations
    sampling_seed=None,                     # the seed used for parameter sampling
    verbose=1,                              # verbosity mode
    n_jobs=None                             # number of jobs to run in parallel
)
```
#### Feature Selection (RFE)
```python
BoostRFE(  
    estimator,                              # LGBModel or XGBModel
    min_features_to_select=None,            # the minimum number of features to be selected  
    step=1,                                 # number of features to remove at each iteration  
    param_grid=None,                        # parameters to be optimized  
    greater_is_better=False,                # minimize or maximize the monitored score  
    importance_type='feature_importances',  # which importance measure to use: default or shap  
    train_importance=True,                  # where to compute the shap feature importance  
    n_iter=None,                            # number of sampled parameter configurations  
    sampling_seed=None,                     # the seed used for parameter sampling  
    verbose=1,                              # verbosity mode  
    n_jobs=None                             # number of jobs to run in parallel  
)  
```
#### Feature Selection (BORUTA)
```python
BoostBoruta(
    estimator,                              # LGBModel or XGBModel
    perc=100,                               # threshold used to compare shadow and real features
    alpha=0.05,                             # p-value levels for feature rejection
    max_iter=100,                           # maximum Boruta iterations to perform
    early_stopping_boruta_rounds=None,      # maximum iterations without confirming a feature
    param_grid=None,                        # parameters to be optimized
    greater_is_better=False,                # minimize or maximize the monitored score
    importance_type='feature_importances',  # which importance measure to use: default or shap
    train_importance=True,                  # where to compute the shap feature importance
    n_iter=None,                            # number of sampled parameter configurations
    sampling_seed=None,                     # the seed used for parameter sampling
    verbose=1,                              # verbosity mode
    n_jobs=None                             # number of jobs to run in parallel
)
```
#### Feature Selection (RFA)
```python
BoostRFA(
    estimator,                              # LGBModel or XGBModel
    min_features_to_select=None,            # the minimum number of features to be selected
    step=1,                                 # number of features to remove at each iteration
    param_grid=None,                        # parameters to be optimized
    greater_is_better=False,                # minimize or maximize the monitored score
    importance_type='feature_importances',  # which importance measure to use: default or shap
    train_importance=True,                  # where to compute the shap feature importance
    n_iter=None,                            # number of sampled parameter configurations
    sampling_seed=None,                     # the seed used for parameter sampling
    verbose=1,                              # verbosity mode
    n_jobs=None                             # number of jobs to run in parallel
)
```

Full examples in the [notebooks folder](https://github.com/cerlymarco/shap-hypetune/tree/main/notebooks).


================================================
FILE: notebooks/LGBM_usage.ipynb
================================================
{"metadata":{"kernelspec":{"language":"python","display_name":"Python 3","name":"python3"},"language_info":{"name":"python","version":"3.7.12","mimetype":"text/x-python","codemirror_mode":{"name":"ipython","version":3},"pygments_lexer":"ipython3","nbconvert_exporter":"python","file_extension":".py"}},"nbformat_minor":4,"nbformat":4,"cells":[{"cell_type":"code","source":"import numpy as np\nimport pandas as pd\nfrom scipy import stats\n\nfrom sklearn.model_selection import train_test_split\nfrom sklearn.datasets import make_classification, make_regression\n\nfrom hyperopt import hp\nfrom hyperopt import Trials\n\nfrom lightgbm import *\n\ntry:\n    from shaphypetune import BoostSearch, BoostBoruta, BoostRFE, BoostRFA\nexcept:\n    !pip install --upgrade shap-hypetune\n    from shaphypetune import BoostSearch, BoostBoruta, BoostRFE, BoostRFA\n\nimport warnings\nwarnings.simplefilter('ignore')","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:46:43.363945Z","iopub.execute_input":"2022-01-01T11:46:43.364356Z","iopub.status.idle":"2022-01-01T11:46:45.084134Z","shell.execute_reply.started":"2022-01-01T11:46:43.364243Z","shell.execute_reply":"2022-01-01T11:46:45.083177Z"},"trusted":true},"execution_count":1,"outputs":[{"output_type":"display_data","data":{"text/plain":"<IPython.core.display.HTML object>","text/html":"<style type='text/css'>\n.datatable table.frame { margin-bottom: 0; }\n.datatable table.frame thead { border-bottom: none; }\n.datatable table.frame tr.coltypes td {  color: #FFFFFF;  line-height: 6px;  padding: 0 0.5em;}\n.datatable .bool    { background: #DDDD99; }\n.datatable .object  { background: #565656; }\n.datatable .int     { background: #5D9E5D; }\n.datatable .float   { background: #4040CC; }\n.datatable .str     { background: #CC4040; }\n.datatable .time    { background: #40CC40; }\n.datatable .row_index {  background: var(--jp-border-color3);  border-right: 1px solid var(--jp-border-color0);  color: var(--jp-ui-font-color3);  font-size: 9px;}\n.datatable .frame tbody td { text-align: left; }\n.datatable .frame tr.coltypes .row_index {  background: var(--jp-border-color0);}\n.datatable th:nth-child(2) { padding-left: 12px; }\n.datatable .hellipsis {  color: var(--jp-cell-editor-border-color);}\n.datatable .vellipsis {  background: var(--jp-layout-color0);  color: var(--jp-cell-editor-border-color);}\n.datatable .na {  color: var(--jp-cell-editor-border-color);  font-size: 80%;}\n.datatable .sp {  opacity: 0.25;}\n.datatable .footer { font-size: 9px; }\n.datatable .frame_dimensions {  background: var(--jp-border-color3);  border-top: 1px solid var(--jp-border-color0);  color: var(--jp-ui-font-color3);  display: inline-block;  opacity: 0.6;  padding: 1px 10px 1px 5px;}\n</style>\n"},"metadata":{}}]},{"cell_type":"code","source":"X_clf, y_clf = make_classification(n_samples=6000, n_features=20, n_classes=2, \n                                   n_informative=4, n_redundant=6, random_state=0)\n\nX_clf_train, X_clf_valid, y_clf_train, y_clf_valid = train_test_split(\n    X_clf, y_clf, test_size=0.3, shuffle=False)\n\nX_regr, y_regr = make_classification(n_samples=6000, n_features=20,\n                                     n_informative=7, random_state=0)\n\nX_regr_train, X_regr_valid, y_regr_train, y_regr_valid = train_test_split(\n    X_regr, y_regr, test_size=0.3, shuffle=False)","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:46:45.086875Z","iopub.execute_input":"2022-01-01T11:46:45.087123Z","iopub.status.idle":"2022-01-01T11:46:45.118700Z","shell.execute_reply.started":"2022-01-01T11:46:45.087094Z","shell.execute_reply":"2022-01-01T11:46:45.117983Z"},"trusted":true},"execution_count":2,"outputs":[]},{"cell_type":"code","source":"param_grid = {\n    'learning_rate': [0.2, 0.1],\n    'num_leaves': [25, 35],\n    'max_depth': [10, 12]\n}\n\nparam_dist = {\n    'learning_rate': stats.uniform(0.09, 0.25),\n    'num_leaves': stats.randint(20,40),\n    'max_depth': [10, 12]\n}\n\nparam_dist_hyperopt = {\n    'max_depth': 15 + hp.randint('num_leaves', 5), \n    'learning_rate': hp.loguniform('learning_rate', np.log(0.01), np.log(0.2)),\n    'colsample_bytree': hp.uniform('colsample_by_tree', 0.6, 1.0)\n}\n\n\nregr_lgbm = LGBMRegressor(n_estimators=150, random_state=0, n_jobs=-1)\nclf_lgbm = LGBMClassifier(n_estimators=150, random_state=0, n_jobs=-1)","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:46:45.120073Z","iopub.execute_input":"2022-01-01T11:46:45.120376Z","iopub.status.idle":"2022-01-01T11:46:45.132838Z","shell.execute_reply.started":"2022-01-01T11:46:45.120336Z","shell.execute_reply":"2022-01-01T11:46:45.131615Z"},"trusted":true},"execution_count":3,"outputs":[]},{"cell_type":"markdown","source":"# Hyperparameters Tuning","metadata":{}},{"cell_type":"code","source":"### HYPERPARAM TUNING WITH GRID-SEARCH ###\n\nmodel = BoostSearch(clf_lgbm, param_grid=param_grid)\nmodel.fit(X_clf_train, y_clf_train, eval_set=[(X_clf_valid, y_clf_valid)], early_stopping_rounds=6, verbose=0)","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:46:45.134450Z","iopub.execute_input":"2022-01-01T11:46:45.135435Z","iopub.status.idle":"2022-01-01T11:46:46.383589Z","shell.execute_reply.started":"2022-01-01T11:46:45.135389Z","shell.execute_reply":"2022-01-01T11:46:46.382860Z"},"trusted":true},"execution_count":4,"outputs":[{"name":"stdout","text":"\n8 trials detected for ('learning_rate', 'num_leaves', 'max_depth')\n\ntrial: 0001 ### iterations: 00023 ### eval_score: 0.2085\ntrial: 0002 ### iterations: 00019 ### eval_score: 0.21112\ntrial: 0003 ### iterations: 00026 ### eval_score: 0.21162\ntrial: 0004 ### iterations: 00032 ### eval_score: 0.20747\ntrial: 0005 ### iterations: 00054 ### eval_score: 0.20244\ntrial: 0006 ### iterations: 00071 ### eval_score: 0.20052\ntrial: 0007 ### iterations: 00047 ### eval_score: 0.20306\ntrial: 0008 ### iterations: 00050 ### eval_score: 0.20506\n","output_type":"stream"},{"execution_count":4,"output_type":"execute_result","data":{"text/plain":"BoostSearch(estimator=LGBMClassifier(n_estimators=150, random_state=0),\n            param_grid={'learning_rate': [0.2, 0.1], 'max_depth': [10, 12],\n                        'num_leaves': [25, 35]})"},"metadata":{}}]},{"cell_type":"code","source":"model.estimator_, model.best_params_, model.best_score_","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:46:46.388010Z","iopub.execute_input":"2022-01-01T11:46:46.389926Z","iopub.status.idle":"2022-01-01T11:46:46.397550Z","shell.execute_reply.started":"2022-01-01T11:46:46.389888Z","shell.execute_reply":"2022-01-01T11:46:46.396658Z"},"trusted":true},"execution_count":5,"outputs":[{"execution_count":5,"output_type":"execute_result","data":{"text/plain":"(LGBMClassifier(max_depth=12, n_estimators=150, num_leaves=25, random_state=0),\n {'learning_rate': 0.1, 'num_leaves': 25, 'max_depth': 12},\n 0.20051586840398297)"},"metadata":{}}]},{"cell_type":"code","source":"(model.score(X_clf_valid, y_clf_valid), \n model.predict(X_clf_valid).shape, \n model.predict_proba(X_clf_valid).shape)","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:46:46.398765Z","iopub.execute_input":"2022-01-01T11:46:46.399534Z","iopub.status.idle":"2022-01-01T11:46:46.436761Z","shell.execute_reply.started":"2022-01-01T11:46:46.399498Z","shell.execute_reply":"2022-01-01T11:46:46.431623Z"},"trusted":true},"execution_count":6,"outputs":[{"execution_count":6,"output_type":"execute_result","data":{"text/plain":"(0.9183333333333333, (1800,), (1800, 2))"},"metadata":{}}]},{"cell_type":"code","source":"### HYPERPARAM TUNING WITH RANDOM-SEARCH ###\n\nmodel = BoostSearch(\n    regr_lgbm, param_grid=param_dist,\n    n_iter=8, sampling_seed=0\n)\nmodel.fit(X_regr_train, y_regr_train, eval_set=[(X_regr_valid, y_regr_valid)], early_stopping_rounds=6, verbose=0)","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:46:46.438241Z","iopub.execute_input":"2022-01-01T11:46:46.438923Z","iopub.status.idle":"2022-01-01T11:46:47.128794Z","shell.execute_reply.started":"2022-01-01T11:46:46.438892Z","shell.execute_reply":"2022-01-01T11:46:47.128107Z"},"trusted":true},"execution_count":7,"outputs":[{"name":"stdout","text":"\n8 trials detected for ('learning_rate', 'num_leaves', 'max_depth')\n\ntrial: 0001 ### iterations: 00023 ### eval_score: 0.07643\ntrial: 0002 ### iterations: 00052 ### eval_score: 0.06818\ntrial: 0003 ### iterations: 00062 ### eval_score: 0.07042\ntrial: 0004 ### iterations: 00033 ### eval_score: 0.07035\ntrial: 0005 ### iterations: 00032 ### eval_score: 0.07153\ntrial: 0006 ### iterations: 00012 ### eval_score: 0.07547\ntrial: 0007 ### iterations: 00041 ### eval_score: 0.07355\ntrial: 0008 ### iterations: 00025 ### eval_score: 0.07805\n","output_type":"stream"},{"execution_count":7,"output_type":"execute_result","data":{"text/plain":"BoostSearch(estimator=LGBMRegressor(n_estimators=150, random_state=0), n_iter=8,\n            param_grid={'learning_rate': <scipy.stats._distn_infrastructure.rv_frozen object at 0x7fd50407f2d0>,\n                        'max_depth': [10, 12],\n                        'num_leaves': <scipy.stats._distn_infrastructure.rv_frozen object at 0x7fd50407f590>},\n            sampling_seed=0)"},"metadata":{}}]},{"cell_type":"code","source":"model.estimator_, model.best_params_, model.best_score_","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:46:47.132071Z","iopub.execute_input":"2022-01-01T11:46:47.132611Z","iopub.status.idle":"2022-01-01T11:46:47.142185Z","shell.execute_reply.started":"2022-01-01T11:46:47.132575Z","shell.execute_reply":"2022-01-01T11:46:47.141271Z"},"trusted":true},"execution_count":8,"outputs":[{"execution_count":8,"output_type":"execute_result","data":{"text/plain":"(LGBMRegressor(learning_rate=0.1350674222191923, max_depth=10, n_estimators=150,\n               num_leaves=38, random_state=0),\n {'learning_rate': 0.1350674222191923, 'num_leaves': 38, 'max_depth': 10},\n 0.06817737242646997)"},"metadata":{}}]},{"cell_type":"code","source":"(model.score(X_regr_valid, y_regr_valid), \n model.predict(X_regr_valid).shape, \n model.predict(X_regr_valid, pred_contrib=True).shape)","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:46:47.143613Z","iopub.execute_input":"2022-01-01T11:46:47.143856Z","iopub.status.idle":"2022-01-01T11:46:47.611056Z","shell.execute_reply.started":"2022-01-01T11:46:47.143827Z","shell.execute_reply":"2022-01-01T11:46:47.610379Z"},"trusted":true},"execution_count":9,"outputs":[{"execution_count":9,"output_type":"execute_result","data":{"text/plain":"(0.7272820930747703, (1800,), (1800, 21))"},"metadata":{}}]},{"cell_type":"code","source":"### HYPERPARAM TUNING WITH HYPEROPT ###\n\nmodel = BoostSearch(\n    regr_lgbm, param_grid=param_dist_hyperopt,\n    n_iter=8, sampling_seed=0\n)\nmodel.fit(\n    X_regr_train, y_regr_train, trials=Trials(), \n    eval_set=[(X_regr_valid, y_regr_valid)], early_stopping_rounds=6, verbose=0\n)","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:46:47.614530Z","iopub.execute_input":"2022-01-01T11:46:47.616779Z","iopub.status.idle":"2022-01-01T11:46:49.268236Z","shell.execute_reply.started":"2022-01-01T11:46:47.616738Z","shell.execute_reply":"2022-01-01T11:46:49.267608Z"},"trusted":true},"execution_count":10,"outputs":[{"name":"stdout","text":"\n8 trials detected for ('max_depth', 'learning_rate', 'colsample_bytree')\n\ntrial: 0001 ### iterations: 00149 ### eval_score: 0.06979\ntrial: 0002 ### iterations: 00055 ### eval_score: 0.07039\ntrial: 0003 ### iterations: 00056 ### eval_score: 0.0716\ntrial: 0004 ### iterations: 00150 ### eval_score: 0.07352\ntrial: 0005 ### iterations: 00150 ### eval_score: 0.07936\ntrial: 0006 ### iterations: 00147 ### eval_score: 0.06833\ntrial: 0007 ### iterations: 00032 ### eval_score: 0.07261\ntrial: 0008 ### iterations: 00096 ### eval_score: 0.07074\n","output_type":"stream"},{"execution_count":10,"output_type":"execute_result","data":{"text/plain":"BoostSearch(estimator=LGBMRegressor(n_estimators=150, random_state=0), n_iter=8,\n            param_grid={'colsample_bytree': <hyperopt.pyll.base.Apply object at 0x7fd50407fd10>,\n                        'learning_rate': <hyperopt.pyll.base.Apply object at 0x7fd50407fa50>,\n                        'max_depth': <hyperopt.pyll.base.Apply object at 0x7fd50407f710>},\n            sampling_seed=0)"},"metadata":{}}]},{"cell_type":"code","source":"model.estimator_, model.best_params_, model.best_score_","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:46:49.271739Z","iopub.execute_input":"2022-01-01T11:46:49.272301Z","iopub.status.idle":"2022-01-01T11:46:49.279337Z","shell.execute_reply.started":"2022-01-01T11:46:49.272264Z","shell.execute_reply":"2022-01-01T11:46:49.278727Z"},"trusted":true},"execution_count":11,"outputs":[{"execution_count":11,"output_type":"execute_result","data":{"text/plain":"(LGBMRegressor(colsample_bytree=0.7597292534356749,\n               learning_rate=0.059836658149176665, max_depth=16,\n               n_estimators=150, random_state=0),\n {'colsample_bytree': 0.7597292534356749,\n  'learning_rate': 0.059836658149176665,\n  'max_depth': 16},\n 0.06832542425080958)"},"metadata":{}}]},{"cell_type":"code","source":"(model.score(X_regr_valid, y_regr_valid), \n model.predict(X_regr_valid).shape, \n model.predict(X_regr_valid, pred_contrib=True).shape)","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:46:49.280499Z","iopub.execute_input":"2022-01-01T11:46:49.280735Z","iopub.status.idle":"2022-01-01T11:46:50.260345Z","shell.execute_reply.started":"2022-01-01T11:46:49.280700Z","shell.execute_reply":"2022-01-01T11:46:50.259694Z"},"trusted":true},"execution_count":12,"outputs":[{"execution_count":12,"output_type":"execute_result","data":{"text/plain":"(0.7266898674988451, (1800,), (1800, 21))"},"metadata":{}}]},{"cell_type":"markdown","source":"# Features Selection","metadata":{}},{"cell_type":"code","source":"### BORUTA ###\n\nmodel = BoostBoruta(clf_lgbm, max_iter=200, perc=100)\nmodel.fit(X_clf_train, y_clf_train, eval_set=[(X_clf_valid, y_clf_valid)], early_stopping_rounds=6, verbose=0)","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:46:50.263726Z","iopub.execute_input":"2022-01-01T11:46:50.265917Z","iopub.status.idle":"2022-01-01T11:46:56.714012Z","shell.execute_reply.started":"2022-01-01T11:46:50.265869Z","shell.execute_reply":"2022-01-01T11:46:56.713278Z"},"trusted":true},"execution_count":13,"outputs":[{"execution_count":13,"output_type":"execute_result","data":{"text/plain":"BoostBoruta(estimator=LGBMClassifier(n_estimators=150, random_state=0),\n            max_iter=200)"},"metadata":{}}]},{"cell_type":"code","source":"model.estimator_, model.n_features_","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:46:56.720017Z","iopub.execute_input":"2022-01-01T11:46:56.720486Z","iopub.status.idle":"2022-01-01T11:46:56.727782Z","shell.execute_reply.started":"2022-01-01T11:46:56.720450Z","shell.execute_reply":"2022-01-01T11:46:56.726815Z"},"trusted":true},"execution_count":14,"outputs":[{"execution_count":14,"output_type":"execute_result","data":{"text/plain":"(LGBMClassifier(n_estimators=150, random_state=0), 10)"},"metadata":{}}]},{"cell_type":"code","source":"(model.score(X_clf_valid, y_clf_valid), \n model.predict(X_clf_valid).shape, \n model.transform(X_clf_valid).shape,\n model.predict_proba(X_clf_valid).shape)","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:46:56.730004Z","iopub.execute_input":"2022-01-01T11:46:56.730326Z","iopub.status.idle":"2022-01-01T11:46:56.765852Z","shell.execute_reply.started":"2022-01-01T11:46:56.730286Z","shell.execute_reply":"2022-01-01T11:46:56.760625Z"},"trusted":true},"execution_count":15,"outputs":[{"execution_count":15,"output_type":"execute_result","data":{"text/plain":"(0.91, (1800,), (1800, 10), (1800, 2))"},"metadata":{}}]},{"cell_type":"code","source":"### RECURSIVE FEATURE ELIMINATION (RFE) ###\n\nmodel = BoostRFE(regr_lgbm, min_features_to_select=1, step=1)\nmodel.fit(X_regr_train, y_regr_train, eval_set=[(X_regr_valid, y_regr_valid)], early_stopping_rounds=6, verbose=0)","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:46:56.767160Z","iopub.execute_input":"2022-01-01T11:46:56.767432Z","iopub.status.idle":"2022-01-01T11:46:59.411924Z","shell.execute_reply.started":"2022-01-01T11:46:56.767401Z","shell.execute_reply":"2022-01-01T11:46:59.411240Z"},"trusted":true},"execution_count":16,"outputs":[{"execution_count":16,"output_type":"execute_result","data":{"text/plain":"BoostRFE(estimator=LGBMRegressor(n_estimators=150, random_state=0),\n         min_features_to_select=1)"},"metadata":{}}]},{"cell_type":"code","source":"model.estimator_, model.n_features_","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:46:59.415300Z","iopub.execute_input":"2022-01-01T11:46:59.417330Z","iopub.status.idle":"2022-01-01T11:46:59.424201Z","shell.execute_reply.started":"2022-01-01T11:46:59.417288Z","shell.execute_reply":"2022-01-01T11:46:59.423561Z"},"trusted":true},"execution_count":17,"outputs":[{"execution_count":17,"output_type":"execute_result","data":{"text/plain":"(LGBMRegressor(n_estimators=150, random_state=0), 7)"},"metadata":{}}]},{"cell_type":"code","source":"(model.score(X_regr_valid, y_regr_valid), \n model.predict(X_regr_valid).shape, \n model.transform(X_regr_valid).shape,\n model.predict(X_regr_valid, pred_contrib=True).shape)","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:46:59.425449Z","iopub.execute_input":"2022-01-01T11:46:59.425674Z","iopub.status.idle":"2022-01-01T11:47:00.248420Z","shell.execute_reply.started":"2022-01-01T11:46:59.425645Z","shell.execute_reply":"2022-01-01T11:47:00.247703Z"},"trusted":true},"execution_count":18,"outputs":[{"execution_count":18,"output_type":"execute_result","data":{"text/plain":"(0.7766363424352807, (1800,), (1800, 7), (1800, 8))"},"metadata":{}}]},{"cell_type":"code","source":"### RECURSIVE FEATURE ADDITION (RFA) ###\n\nmodel = BoostRFA(regr_lgbm, min_features_to_select=1, step=1)\nmodel.fit(X_regr_train, y_regr_train, eval_set=[(X_regr_valid, y_regr_valid)], early_stopping_rounds=6, verbose=0)","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:47:00.251993Z","iopub.execute_input":"2022-01-01T11:47:00.252510Z","iopub.status.idle":"2022-01-01T11:47:03.954790Z","shell.execute_reply.started":"2022-01-01T11:47:00.252473Z","shell.execute_reply":"2022-01-01T11:47:03.954052Z"},"trusted":true},"execution_count":19,"outputs":[{"execution_count":19,"output_type":"execute_result","data":{"text/plain":"BoostRFA(estimator=LGBMRegressor(n_estimators=150, random_state=0),\n         min_features_to_select=1)"},"metadata":{}}]},{"cell_type":"code","source":"model.estimator_, model.n_features_","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:47:03.958397Z","iopub.execute_input":"2022-01-01T11:47:03.958982Z","iopub.status.idle":"2022-01-01T11:47:03.967715Z","shell.execute_reply.started":"2022-01-01T11:47:03.958931Z","shell.execute_reply":"2022-01-01T11:47:03.966909Z"},"trusted":true},"execution_count":20,"outputs":[{"execution_count":20,"output_type":"execute_result","data":{"text/plain":"(LGBMRegressor(n_estimators=150, random_state=0), 8)"},"metadata":{}}]},{"cell_type":"code","source":"(model.score(X_regr_valid, y_regr_valid), \n model.predict(X_regr_valid).shape, \n model.transform(X_regr_valid).shape,\n model.predict(X_regr_valid, pred_contrib=True).shape)","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:47:03.969215Z","iopub.execute_input":"2022-01-01T11:47:03.969612Z","iopub.status.idle":"2022-01-01T11:47:04.838820Z","shell.execute_reply.started":"2022-01-01T11:47:03.969569Z","shell.execute_reply":"2022-01-01T11:47:04.838192Z"},"trusted":true},"execution_count":21,"outputs":[{"execution_count":21,"output_type":"execute_result","data":{"text/plain":"(0.7723191919698336, (1800,), (1800, 8), (1800, 9))"},"metadata":{}}]},{"cell_type":"markdown","source":"# Features Selection with SHAP","metadata":{}},{"cell_type":"code","source":"### BORUTA SHAP ###\n\nmodel = BoostBoruta(\n    clf_lgbm, max_iter=200, perc=100,\n    importance_type='shap_importances', train_importance=False\n)\nmodel.fit(X_clf_train, y_clf_train, eval_set=[(X_clf_valid, y_clf_valid)], early_stopping_rounds=6, verbose=0)","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:47:04.842289Z","iopub.execute_input":"2022-01-01T11:47:04.844564Z","iopub.status.idle":"2022-01-01T11:47:17.780389Z","shell.execute_reply.started":"2022-01-01T11:47:04.844522Z","shell.execute_reply":"2022-01-01T11:47:17.779726Z"},"trusted":true},"execution_count":22,"outputs":[{"execution_count":22,"output_type":"execute_result","data":{"text/plain":"BoostBoruta(estimator=LGBMClassifier(n_estimators=150, random_state=0),\n            importance_type='shap_importances', max_iter=200,\n            train_importance=False)"},"metadata":{}}]},{"cell_type":"code","source":"model.estimator_, model.n_features_","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:47:17.781535Z","iopub.execute_input":"2022-01-01T11:47:17.784569Z","iopub.status.idle":"2022-01-01T11:47:17.791371Z","shell.execute_reply.started":"2022-01-01T11:47:17.784530Z","shell.execute_reply":"2022-01-01T11:47:17.790591Z"},"trusted":true},"execution_count":23,"outputs":[{"execution_count":23,"output_type":"execute_result","data":{"text/plain":"(LGBMClassifier(n_estimators=150, random_state=0), 9)"},"metadata":{}}]},{"cell_type":"code","source":"(model.score(X_clf_valid, y_clf_valid), \n model.predict(X_clf_valid).shape, \n model.transform(X_clf_valid).shape,\n model.predict_proba(X_clf_valid).shape)","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:47:17.794450Z","iopub.execute_input":"2022-01-01T11:47:17.794986Z","iopub.status.idle":"2022-01-01T11:47:17.813842Z","shell.execute_reply.started":"2022-01-01T11:47:17.794933Z","shell.execute_reply":"2022-01-01T11:47:17.813126Z"},"trusted":true},"execution_count":24,"outputs":[{"execution_count":24,"output_type":"execute_result","data":{"text/plain":"(0.9111111111111111, (1800,), (1800, 9), (1800, 2))"},"metadata":{}}]},{"cell_type":"code","source":"### RECURSIVE FEATURE ELIMINATION (RFE) SHAP ###\n\nmodel = BoostRFE(\n    regr_lgbm, min_features_to_select=1, step=1,\n    importance_type='shap_importances', train_importance=False\n)\nmodel.fit(X_regr_train, y_regr_train, eval_set=[(X_regr_valid, y_regr_valid)], early_stopping_rounds=6, verbose=0)","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:47:17.817477Z","iopub.execute_input":"2022-01-01T11:47:17.819641Z","iopub.status.idle":"2022-01-01T11:47:32.735329Z","shell.execute_reply.started":"2022-01-01T11:47:17.819595Z","shell.execute_reply":"2022-01-01T11:47:32.734687Z"},"trusted":true},"execution_count":25,"outputs":[{"execution_count":25,"output_type":"execute_result","data":{"text/plain":"BoostRFE(estimator=LGBMRegressor(n_estimators=150, random_state=0),\n         importance_type='shap_importances', min_features_to_select=1,\n         train_importance=False)"},"metadata":{}}]},{"cell_type":"code","source":"model.estimator_, model.n_features_","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:47:32.736646Z","iopub.execute_input":"2022-01-01T11:47:32.737109Z","iopub.status.idle":"2022-01-01T11:47:32.743398Z","shell.execute_reply.started":"2022-01-01T11:47:32.737074Z","shell.execute_reply":"2022-01-01T11:47:32.742747Z"},"trusted":true},"execution_count":26,"outputs":[{"execution_count":26,"output_type":"execute_result","data":{"text/plain":"(LGBMRegressor(n_estimators=150, random_state=0), 7)"},"metadata":{}}]},{"cell_type":"code","source":"(model.score(X_regr_valid, y_regr_valid), \n model.predict(X_regr_valid).shape, \n model.transform(X_regr_valid).shape,\n model.predict(X_regr_valid, pred_contrib=True).shape)","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:47:32.744765Z","iopub.execute_input":"2022-01-01T11:47:32.747374Z","iopub.status.idle":"2022-01-01T11:47:33.570515Z","shell.execute_reply.started":"2022-01-01T11:47:32.747336Z","shell.execute_reply":"2022-01-01T11:47:33.569899Z"},"trusted":true},"execution_count":27,"outputs":[{"execution_count":27,"output_type":"execute_result","data":{"text/plain":"(0.7766363424352807, (1800,), (1800, 7), (1800, 8))"},"metadata":{}}]},{"cell_type":"code","source":"### RECURSIVE FEATURE ADDITION (RFA) SHAP ###\n\nmodel = BoostRFA(\n    regr_lgbm, min_features_to_select=1, step=1,\n    importance_type='shap_importances', train_importance=False\n)\nmodel.fit(X_regr_train, y_regr_train, eval_set=[(X_regr_valid, y_regr_valid)], early_stopping_rounds=6, verbose=0)","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:47:33.571778Z","iopub.execute_input":"2022-01-01T11:47:33.572261Z","iopub.status.idle":"2022-01-01T11:47:39.941084Z","shell.execute_reply.started":"2022-01-01T11:47:33.572226Z","shell.execute_reply":"2022-01-01T11:47:39.940356Z"},"trusted":true},"execution_count":28,"outputs":[{"execution_count":28,"output_type":"execute_result","data":{"text/plain":"BoostRFA(estimator=LGBMRegressor(n_estimators=150, random_state=0),\n         importance_type='shap_importances', min_features_to_select=1,\n         train_importance=False)"},"metadata":{}}]},{"cell_type":"code","source":"model.estimator_, model.n_features_","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:47:39.944497Z","iopub.execute_input":"2022-01-01T11:47:39.946592Z","iopub.status.idle":"2022-01-01T11:47:39.953717Z","shell.execute_reply.started":"2022-01-01T11:47:39.946550Z","shell.execute_reply":"2022-01-01T11:47:39.952924Z"},"trusted":true},"execution_count":29,"outputs":[{"execution_count":29,"output_type":"execute_result","data":{"text/plain":"(LGBMRegressor(n_estimators=150, random_state=0), 9)"},"metadata":{}}]},{"cell_type":"code","source":"(model.score(X_regr_valid, y_regr_valid), \n model.predict(X_regr_valid).shape, \n model.transform(X_regr_valid).shape,\n model.predict(X_regr_valid, pred_contrib=True).shape)","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:47:39.954955Z","iopub.execute_input":"2022-01-01T11:47:39.955713Z","iopub.status.idle":"2022-01-01T11:47:40.853749Z","shell.execute_reply.started":"2022-01-01T11:47:39.955669Z","shell.execute_reply":"2022-01-01T11:47:40.853100Z"},"trusted":true},"execution_count":30,"outputs":[{"execution_count":30,"output_type":"execute_result","data":{"text/plain":"(0.7699366468805918, (1800,), (1800, 9), (1800, 10))"},"metadata":{}}]},{"cell_type":"markdown","source":"# Hyperparameters Tuning + Features Selection","metadata":{}},{"cell_type":"code","source":"### HYPERPARAM TUNING WITH GRID-SEARCH + BORUTA ###\n\nmodel = BoostBoruta(clf_lgbm, param_grid=param_grid, max_iter=200, perc=100)\nmodel.fit(X_clf_train, y_clf_train, eval_set=[(X_clf_valid, y_clf_valid)], early_stopping_rounds=6, verbose=0)","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:47:40.857000Z","iopub.execute_input":"2022-01-01T11:47:40.859123Z","iopub.status.idle":"2022-01-01T11:48:08.045782Z","shell.execute_reply.started":"2022-01-01T11:47:40.859074Z","shell.execute_reply":"2022-01-01T11:48:08.043191Z"},"trusted":true},"execution_count":31,"outputs":[{"name":"stdout","text":"\n8 trials detected for ('learning_rate', 'num_leaves', 'max_depth')\n\ntrial: 0001 ### iterations: 00023 ### eval_score: 0.19868\ntrial: 0002 ### iterations: 00030 ### eval_score: 0.19844\ntrial: 0003 ### iterations: 00023 ### eval_score: 0.19695\ntrial: 0004 ### iterations: 00026 ### eval_score: 0.19949\ntrial: 0005 ### iterations: 00067 ### eval_score: 0.19583\ntrial: 0006 ### iterations: 00051 ### eval_score: 0.1949\ntrial: 0007 ### iterations: 00045 ### eval_score: 0.19675\ntrial: 0008 ### iterations: 00055 ### eval_score: 0.19906\n","output_type":"stream"},{"execution_count":31,"output_type":"execute_result","data":{"text/plain":"BoostBoruta(estimator=LGBMClassifier(n_estimators=150, random_state=0),\n            max_iter=200,\n            param_grid={'learning_rate': [0.2, 0.1], 'max_depth': [10, 12],\n                        'num_leaves': [25, 35]})"},"metadata":{}}]},{"cell_type":"code","source":"model.estimator_, model.best_params_, model.best_score_, model.n_features_","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:48:08.047190Z","iopub.execute_input":"2022-01-01T11:48:08.048047Z","iopub.status.idle":"2022-01-01T11:48:08.056353Z","shell.execute_reply.started":"2022-01-01T11:48:08.048000Z","shell.execute_reply":"2022-01-01T11:48:08.055615Z"},"trusted":true},"execution_count":32,"outputs":[{"execution_count":32,"output_type":"execute_result","data":{"text/plain":"(LGBMClassifier(max_depth=12, n_estimators=150, num_leaves=25, random_state=0),\n {'learning_rate': 0.1, 'num_leaves': 25, 'max_depth': 12},\n 0.19489866976777023,\n 9)"},"metadata":{}}]},{"cell_type":"code","source":"(model.score(X_clf_valid, y_clf_valid), \n model.predict(X_clf_valid).shape, \n model.transform(X_clf_valid).shape,\n model.predict_proba(X_clf_valid).shape)","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:48:08.058015Z","iopub.execute_input":"2022-01-01T11:48:08.058593Z","iopub.status.idle":"2022-01-01T11:48:08.109632Z","shell.execute_reply.started":"2022-01-01T11:48:08.058410Z","shell.execute_reply":"2022-01-01T11:48:08.108670Z"},"trusted":true},"execution_count":33,"outputs":[{"execution_count":33,"output_type":"execute_result","data":{"text/plain":"(0.915, (1800,), (1800, 9), (1800, 2))"},"metadata":{}}]},{"cell_type":"code","source":"### HYPERPARAM TUNING WITH RANDOM-SEARCH + RECURSIVE FEATURE ELIMINATION (RFE) ###\n\nmodel = BoostRFE(\n    regr_lgbm, param_grid=param_dist, min_features_to_select=1, step=1,\n    n_iter=8, sampling_seed=0\n)\nmodel.fit(X_regr_train, y_regr_train, eval_set=[(X_regr_valid, y_regr_valid)], early_stopping_rounds=6, verbose=0)","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:48:08.114460Z","iopub.execute_input":"2022-01-01T11:48:08.116626Z","iopub.status.idle":"2022-01-01T11:48:20.506235Z","shell.execute_reply.started":"2022-01-01T11:48:08.116579Z","shell.execute_reply":"2022-01-01T11:48:20.505511Z"},"trusted":true},"execution_count":34,"outputs":[{"name":"stdout","text":"\n8 trials detected for ('learning_rate', 'num_leaves', 'max_depth')\n\ntrial: 0001 ### iterations: 00107 ### eval_score: 0.06016\ntrial: 0002 ### iterations: 00095 ### eval_score: 0.05711\ntrial: 0003 ### iterations: 00121 ### eval_score: 0.05926\ntrial: 0004 ### iterations: 00103 ### eval_score: 0.05688\ntrial: 0005 ### iterations: 00119 ### eval_score: 0.05618\ntrial: 0006 ### iterations: 00049 ### eval_score: 0.06188\ntrial: 0007 ### iterations: 00150 ### eval_score: 0.05538\ntrial: 0008 ### iterations: 00083 ### eval_score: 0.06084\n","output_type":"stream"},{"execution_count":34,"output_type":"execute_result","data":{"text/plain":"BoostRFE(estimator=LGBMRegressor(n_estimators=150, random_state=0),\n         min_features_to_select=1, n_iter=8,\n         param_grid={'learning_rate': <scipy.stats._distn_infrastructure.rv_frozen object at 0x7fd50407f2d0>,\n                     'max_depth': [10, 12],\n                     'num_leaves': <scipy.stats._distn_infrastructure.rv_frozen object at 0x7fd50407f590>},\n         sampling_seed=0)"},"metadata":{}}]},{"cell_type":"code","source":"model.estimator_, model.best_params_, model.best_score_, model.n_features_","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:48:20.509788Z","iopub.execute_input":"2022-01-01T11:48:20.511633Z","iopub.status.idle":"2022-01-01T11:48:20.521139Z","shell.execute_reply.started":"2022-01-01T11:48:20.511592Z","shell.execute_reply":"2022-01-01T11:48:20.520293Z"},"trusted":true},"execution_count":35,"outputs":[{"execution_count":35,"output_type":"execute_result","data":{"text/plain":"(LGBMRegressor(learning_rate=0.13639381870463482, max_depth=12, n_estimators=150,\n               num_leaves=25, random_state=0),\n {'learning_rate': 0.13639381870463482, 'num_leaves': 25, 'max_depth': 12},\n 0.0553821617278472,\n 7)"},"metadata":{}}]},{"cell_type":"code","source":"(model.score(X_regr_valid, y_regr_valid), \n model.predict(X_regr_valid).shape, \n model.transform(X_regr_valid).shape,\n model.predict(X_regr_valid, pred_contrib=True).shape)","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:48:20.522443Z","iopub.execute_input":"2022-01-01T11:48:20.522817Z","iopub.status.idle":"2022-01-01T11:48:21.145683Z","shell.execute_reply.started":"2022-01-01T11:48:20.522785Z","shell.execute_reply":"2022-01-01T11:48:21.145033Z"},"trusted":true},"execution_count":36,"outputs":[{"execution_count":36,"output_type":"execute_result","data":{"text/plain":"(0.7784645155736596, (1800,), (1800, 7), (1800, 8))"},"metadata":{}}]},{"cell_type":"code","source":"### HYPERPARAM TUNING WITH HYPEROPT + RECURSIVE FEATURE ADDITION (RFA) ###\n\nmodel = BoostRFA(\n    regr_lgbm, param_grid=param_dist_hyperopt, min_features_to_select=1, step=1,\n    n_iter=8, sampling_seed=0\n)\nmodel.fit(\n    X_regr_train, y_regr_train, trials=Trials(), \n    eval_set=[(X_regr_valid, y_regr_valid)], early_stopping_rounds=6, verbose=0\n)","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:48:21.149492Z","iopub.execute_input":"2022-01-01T11:48:21.151302Z","iopub.status.idle":"2022-01-01T11:48:56.679453Z","shell.execute_reply.started":"2022-01-01T11:48:21.151261Z","shell.execute_reply":"2022-01-01T11:48:56.678720Z"},"trusted":true},"execution_count":37,"outputs":[{"name":"stdout","text":"\n8 trials detected for ('max_depth', 'learning_rate', 'colsample_bytree')\n\ntrial: 0001 ### iterations: 00150 ### eval_score: 0.06507\ntrial: 0002 ### iterations: 00075 ### eval_score: 0.05784\ntrial: 0003 ### iterations: 00095 ### eval_score: 0.06088\ntrial: 0004 ### iterations: 00150 ### eval_score: 0.06976\ntrial: 0005 ### iterations: 00150 ### eval_score: 0.07593\ntrial: 0006 ### iterations: 00149 ### eval_score: 0.05995\ntrial: 0007 ### iterations: 00058 ### eval_score: 0.05916\ntrial: 0008 ### iterations: 00150 ### eval_score: 0.06366\n","output_type":"stream"},{"execution_count":37,"output_type":"execute_result","data":{"text/plain":"BoostRFA(estimator=LGBMRegressor(n_estimators=150, random_state=0),\n         min_features_to_select=1, n_iter=8,\n         param_grid={'colsample_bytree': <hyperopt.pyll.base.Apply object at 0x7fd50407fd10>,\n                     'learning_rate': <hyperopt.pyll.base.Apply object at 0x7fd50407fa50>,\n                     'max_depth': <hyperopt.pyll.base.Apply object at 0x7fd50407f710>},\n         sampling_seed=0)"},"metadata":{}}]},{"cell_type":"code","source":"model.estimator_, model.best_params_, model.best_score_, model.n_features_","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:48:56.682847Z","iopub.execute_input":"2022-01-01T11:48:56.684405Z","iopub.status.idle":"2022-01-01T11:48:56.691812Z","shell.execute_reply.started":"2022-01-01T11:48:56.684368Z","shell.execute_reply":"2022-01-01T11:48:56.690932Z"},"trusted":true},"execution_count":38,"outputs":[{"execution_count":38,"output_type":"execute_result","data":{"text/plain":"(LGBMRegressor(colsample_bytree=0.8515260655364685,\n               learning_rate=0.13520045129619862, max_depth=18, n_estimators=150,\n               random_state=0),\n {'colsample_bytree': 0.8515260655364685,\n  'learning_rate': 0.13520045129619862,\n  'max_depth': 18},\n 0.0578369356489881,\n 8)"},"metadata":{}}]},{"cell_type":"code","source":"(model.score(X_regr_valid, y_regr_valid), \n model.predict(X_regr_valid).shape, \n model.transform(X_regr_valid).shape,\n model.predict(X_regr_valid, pred_contrib=True).shape)","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:48:56.693078Z","iopub.execute_input":"2022-01-01T11:48:56.693305Z","iopub.status.idle":"2022-01-01T11:48:57.115924Z","shell.execute_reply.started":"2022-01-01T11:48:56.693277Z","shell.execute_reply":"2022-01-01T11:48:57.115308Z"},"trusted":true},"execution_count":39,"outputs":[{"execution_count":39,"output_type":"execute_result","data":{"text/plain":"(0.7686451168212334, (1800,), (1800, 8), (1800, 9))"},"metadata":{}}]},{"cell_type":"markdown","source":"# Hyperparameters Tuning + Features Selection with SHAP","metadata":{}},{"cell_type":"code","source":"### HYPERPARAM TUNING WITH GRID-SEARCH + BORUTA SHAP ###\n\nmodel = BoostBoruta(\n    clf_lgbm, param_grid=param_grid, max_iter=200, perc=100,\n    importance_type='shap_importances', train_importance=False\n)\nmodel.fit(X_clf_train, y_clf_train, eval_set=[(X_clf_valid, y_clf_valid)], early_stopping_rounds=6, verbose=0)","metadata":{"scrolled":true,"execution":{"iopub.status.busy":"2022-01-01T11:48:57.119397Z","iopub.execute_input":"2022-01-01T11:48:57.120009Z","iopub.status.idle":"2022-01-01T11:50:15.982498Z","shell.execute_reply.started":"2022-01-01T11:48:57.119958Z","shell.execute_reply":"2022-01-01T11:50:15.981774Z"},"trusted":true},"execution_count":40,"outputs":[{"name":"stdout","text":"\n8 trials detected for ('learning_rate', 'num_leaves', 'max_depth')\n\ntrial: 0001 ### iterations: 00036 ### eval_score: 0.19716\ntrial: 0002 ### iterations: 00030 ### eval_score: 0.19818\ntrial: 0003 ### iterations: 00031 ### eval_score: 0.19881\ntrial: 0004 ### iterations: 00026 ### eval_score: 0.19949\ntrial: 0005 ### iterations: 00067 ### eval_score: 0.19583\ntrial: 0006 ### iterations: 00051 ### eval_score: 0.1949\ntrial: 0007 ### iterations: 00045 ### eval_score: 0.19675\ntrial: 0008 ### iterations: 00057 ### eval_score: 0.19284\n","output_type":"stream"},{"execution_count":40,"output_type":"execute_result","data":{"text/plain":"BoostBoruta(estimator=LGBMClassifier(n_estimators=150, random_state=0),\n            importance_type='shap_importances', max_iter=200,\n            param_grid={'learning_rate': [0.2, 0.1], 'max_depth': [10, 12],\n                        'num_leaves': [25, 35]},\n            train_importance=False)"},"metadata":{}}]},{"cell_type":"code","source":"model.estimator_, model.best_params_, model.best_score_, model.n_features_","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:50:15.988196Z","iopub.execute_input":"2022-01-01T11:50:15.988729Z","iopub.status.idle":"2022-01-01T11:50:15.996898Z","shell.execute_reply.started":"2022-01-01T11:50:15.988685Z","shell.execute_reply":"2022-01-01T11:50:15.996175Z"},"trusted":true},"execution_count":41,"outputs":[{"execution_count":41,"output_type":"execute_result","data":{"text/plain":"(LGBMClassifier(max_depth=12, n_estimators=150, num_leaves=35, random_state=0),\n {'learning_rate': 0.1, 'num_leaves': 35, 'max_depth': 12},\n 0.1928371931511303,\n 10)"},"metadata":{}}]},{"cell_type":"code","source":"(model.score(X_clf_valid, y_clf_valid), \n model.predict(X_clf_valid).shape, \n model.transform(X_clf_valid).shape,\n model.predict_proba(X_clf_valid).shape)","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:50:15.998631Z","iopub.execute_input":"2022-01-01T11:50:15.999269Z","iopub.status.idle":"2022-01-01T11:50:16.029050Z","shell.execute_reply.started":"2022-01-01T11:50:15.999228Z","shell.execute_reply":"2022-01-01T11:50:16.028270Z"},"trusted":true},"execution_count":42,"outputs":[{"execution_count":42,"output_type":"execute_result","data":{"text/plain":"(0.9111111111111111, (1800,), (1800, 10), (1800, 2))"},"metadata":{}}]},{"cell_type":"code","source":"### HYPERPARAM TUNING WITH RANDOM-SEARCH + RECURSIVE FEATURE ELIMINATION (RFE) SHAP ###\n\nmodel = BoostRFE(\n    regr_lgbm, param_grid=param_dist, min_features_to_select=1, step=1,\n    n_iter=8, sampling_seed=0,\n    importance_type='shap_importances', train_importance=False\n)\nmodel.fit(X_regr_train, y_regr_train, eval_set=[(X_regr_valid, y_regr_valid)], early_stopping_rounds=6, verbose=0)","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:50:16.030261Z","iopub.execute_input":"2022-01-01T11:50:16.030658Z","iopub.status.idle":"2022-01-01T11:51:19.095150Z","shell.execute_reply.started":"2022-01-01T11:50:16.030625Z","shell.execute_reply":"2022-01-01T11:51:19.094483Z"},"trusted":true},"execution_count":43,"outputs":[{"name":"stdout","text":"\n8 trials detected for ('learning_rate', 'num_leaves', 'max_depth')\n\ntrial: 0001 ### iterations: 00107 ### eval_score: 0.06016\ntrial: 0002 ### iterations: 00102 ### eval_score: 0.05525\ntrial: 0003 ### iterations: 00150 ### eval_score: 0.05869\ntrial: 0004 ### iterations: 00149 ### eval_score: 0.05863\ntrial: 0005 ### iterations: 00119 ### eval_score: 0.05618\ntrial: 0006 ### iterations: 00049 ### eval_score: 0.06188\ntrial: 0007 ### iterations: 00150 ### eval_score: 0.05538\ntrial: 0008 ### iterations: 00083 ### eval_score: 0.06084\n","output_type":"stream"},{"execution_count":43,"output_type":"execute_result","data":{"text/plain":"BoostRFE(estimator=LGBMRegressor(n_estimators=150, random_state=0),\n         importance_type='shap_importances', min_features_to_select=1, n_iter=8,\n         param_grid={'learning_rate': <scipy.stats._distn_infrastructure.rv_frozen object at 0x7fd50407f2d0>,\n                     'max_depth': [10, 12],\n                     'num_leaves': <scipy.stats._distn_infrastructure.rv_frozen object at 0x7fd50407f590>},\n         sampling_seed=0, train_importance=False)"},"metadata":{}}]},{"cell_type":"code","source":"model.estimator_, model.best_params_, model.best_score_, model.n_features_","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:51:19.098487Z","iopub.execute_input":"2022-01-01T11:51:19.099062Z","iopub.status.idle":"2022-01-01T11:51:19.108772Z","shell.execute_reply.started":"2022-01-01T11:51:19.099027Z","shell.execute_reply":"2022-01-01T11:51:19.107939Z"},"trusted":true},"execution_count":44,"outputs":[{"execution_count":44,"output_type":"execute_result","data":{"text/plain":"(LGBMRegressor(learning_rate=0.1350674222191923, max_depth=10, n_estimators=150,\n               num_leaves=38, random_state=0),\n {'learning_rate': 0.1350674222191923, 'num_leaves': 38, 'max_depth': 10},\n 0.05524518772497125,\n 9)"},"metadata":{}}]},{"cell_type":"code","source":"(model.score(X_regr_valid, y_regr_valid), \n model.predict(X_regr_valid).shape, \n model.transform(X_regr_valid).shape,\n model.predict(X_regr_valid, pred_contrib=True).shape)","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:51:19.110141Z","iopub.execute_input":"2022-01-01T11:51:19.110358Z","iopub.status.idle":"2022-01-01T11:51:19.840667Z","shell.execute_reply.started":"2022-01-01T11:51:19.110333Z","shell.execute_reply":"2022-01-01T11:51:19.840035Z"},"trusted":true},"execution_count":45,"outputs":[{"execution_count":45,"output_type":"execute_result","data":{"text/plain":"(0.779012428496056, (1800,), (1800, 9), (1800, 10))"},"metadata":{}}]},{"cell_type":"code","source":"### HYPERPARAM TUNING WITH HYPEROPT + RECURSIVE FEATURE ADDITION (RFA) SHAP ###\n\nmodel = BoostRFA(\n    regr_lgbm, param_grid=param_dist_hyperopt, min_features_to_select=1, step=1,\n    n_iter=8, sampling_seed=0,\n    importance_type='shap_importances', train_importance=False\n)\nmodel.fit(\n    X_regr_train, y_regr_train, trials=Trials(), \n    eval_set=[(X_regr_valid, y_regr_valid)], early_stopping_rounds=6, verbose=0\n)","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:51:19.844245Z","iopub.execute_input":"2022-01-01T11:51:19.844839Z","iopub.status.idle":"2022-01-01T11:52:27.830673Z","shell.execute_reply.started":"2022-01-01T11:51:19.844800Z","shell.execute_reply":"2022-01-01T11:52:27.829915Z"},"trusted":true},"execution_count":46,"outputs":[{"name":"stdout","text":"\n8 trials detected for ('max_depth', 'learning_rate', 'colsample_bytree')\n\ntrial: 0001 ### iterations: 00150 ### eval_score: 0.06508\ntrial: 0002 ### iterations: 00091 ### eval_score: 0.05997\ntrial: 0003 ### iterations: 00094 ### eval_score: 0.06078\ntrial: 0004 ### iterations: 00150 ### eval_score: 0.06773\ntrial: 0005 ### iterations: 00150 ### eval_score: 0.07565\ntrial: 0006 ### iterations: 00150 ### eval_score: 0.05935\ntrial: 0007 ### iterations: 00083 ### eval_score: 0.06047\ntrial: 0008 ### iterations: 00150 ### eval_score: 0.05966\n","output_type":"stream"},{"execution_count":46,"output_type":"execute_result","data":{"text/plain":"BoostRFA(estimator=LGBMRegressor(n_estimators=150, random_state=0),\n         importance_type='shap_importances', min_features_to_select=1, n_iter=8,\n         param_grid={'colsample_bytree': <hyperopt.pyll.base.Apply object at 0x7fd50407fd10>,\n                     'learning_rate': <hyperopt.pyll.base.Apply object at 0x7fd50407fa50>,\n                     'max_depth': <hyperopt.pyll.base.Apply object at 0x7fd50407f710>},\n         sampling_seed=0, train_importance=False)"},"metadata":{}}]},{"cell_type":"code","source":"model.estimator_, model.best_params_, model.best_score_, model.n_features_","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:52:27.834402Z","iopub.execute_input":"2022-01-01T11:52:27.835864Z","iopub.status.idle":"2022-01-01T11:52:27.842813Z","shell.execute_reply.started":"2022-01-01T11:52:27.835812Z","shell.execute_reply":"2022-01-01T11:52:27.842095Z"},"trusted":true},"execution_count":47,"outputs":[{"execution_count":47,"output_type":"execute_result","data":{"text/plain":"(LGBMRegressor(colsample_bytree=0.7597292534356749,\n               learning_rate=0.059836658149176665, max_depth=16,\n               n_estimators=150, random_state=0),\n {'colsample_bytree': 0.7597292534356749,\n  'learning_rate': 0.059836658149176665,\n  'max_depth': 16},\n 0.059352961644604275,\n 9)"},"metadata":{}}]},{"cell_type":"code","source":"(model.score(X_regr_valid, y_regr_valid), \n model.predict(X_regr_valid).shape, \n model.transform(X_regr_valid).shape,\n model.predict(X_regr_valid, pred_contrib=True).shape)","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:52:27.844085Z","iopub.execute_input":"2022-01-01T11:52:27.844320Z","iopub.status.idle":"2022-01-01T11:52:28.690931Z","shell.execute_reply.started":"2022-01-01T11:52:27.844291Z","shell.execute_reply":"2022-01-01T11:52:28.690302Z"},"trusted":true},"execution_count":48,"outputs":[{"execution_count":48,"output_type":"execute_result","data":{"text/plain":"(0.7625808256692885, (1800,), (1800, 9), (1800, 10))"},"metadata":{}}]},{"cell_type":"markdown","source":"# CUSTOM EVAL METRIC SUPPORT","metadata":{}},{"cell_type":"code","source":"from sklearn.metrics import roc_auc_score\n\ndef AUC(y_true, y_hat):\n    return 'auc', roc_auc_score(y_true, y_hat), True","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:52:28.691909Z","iopub.execute_input":"2022-01-01T11:52:28.692560Z","iopub.status.idle":"2022-01-01T11:52:28.696813Z","shell.execute_reply.started":"2022-01-01T11:52:28.692526Z","shell.execute_reply":"2022-01-01T11:52:28.696058Z"},"trusted":true},"execution_count":49,"outputs":[]},{"cell_type":"code","source":"model = BoostRFE(\n    LGBMClassifier(n_estimators=150, random_state=0, metric=\"custom\"), \n    param_grid=param_grid, min_features_to_select=1, step=1,\n    greater_is_better=True\n)\nmodel.fit(\n    X_clf_train, y_clf_train, \n    eval_set=[(X_clf_valid, y_clf_valid)], early_stopping_rounds=6, verbose=0, \n    eval_metric=AUC\n)","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:52:28.700234Z","iopub.execute_input":"2022-01-01T11:52:28.700461Z","iopub.status.idle":"2022-01-01T11:52:49.577997Z","shell.execute_reply.started":"2022-01-01T11:52:28.700433Z","shell.execute_reply":"2022-01-01T11:52:49.577317Z"},"trusted":true},"execution_count":50,"outputs":[{"name":"stdout","text":"\n8 trials detected for ('learning_rate', 'num_leaves', 'max_depth')\n\ntrial: 0001 ### iterations: 00028 ### eval_score: 0.97581\ntrial: 0002 ### iterations: 00016 ### eval_score: 0.97514\ntrial: 0003 ### iterations: 00015 ### eval_score: 0.97574\ntrial: 0004 ### iterations: 00032 ### eval_score: 0.97549\ntrial: 0005 ### iterations: 00075 ### eval_score: 0.97551\ntrial: 0006 ### iterations: 00041 ### eval_score: 0.97597\ntrial: 0007 ### iterations: 00076 ### eval_score: 0.97592\ntrial: 0008 ### iterations: 00060 ### eval_score: 0.97539\n","output_type":"stream"},{"execution_count":50,"output_type":"execute_result","data":{"text/plain":"BoostRFE(estimator=LGBMClassifier(metric='custom', n_estimators=150,\n                                  random_state=0),\n         greater_is_better=True, min_features_to_select=1,\n         param_grid={'learning_rate': [0.2, 0.1], 'max_depth': [10, 12],\n                     'num_leaves': [25, 35]})"},"metadata":{}}]},{"cell_type":"markdown","source":"# CATEGORICAL FEATURE SUPPORT","metadata":{}},{"cell_type":"code","source":"categorical_feature = [0,1,2]\n\nX_clf_train[:,categorical_feature] = (X_clf_train[:,categorical_feature]+100).clip(0).astype(int)\nX_clf_valid[:,categorical_feature] = (X_clf_valid[:,categorical_feature]+100).clip(0).astype(int)","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:52:49.581409Z","iopub.execute_input":"2022-01-01T11:52:49.581982Z","iopub.status.idle":"2022-01-01T11:52:49.589315Z","shell.execute_reply.started":"2022-01-01T11:52:49.581931Z","shell.execute_reply":"2022-01-01T11:52:49.588511Z"},"trusted":true},"execution_count":51,"outputs":[]},{"cell_type":"code","source":"### MANUAL PASS categorical_feature WITH NUMPY ARRAYS ###\n\nmodel = BoostRFE(clf_lgbm, param_grid=param_grid, min_features_to_select=1, step=1)\nmodel.fit(\n    X_clf_train, y_clf_train, \n    eval_set=[(X_clf_valid, y_clf_valid)], early_stopping_rounds=6, verbose=0,\n    categorical_feature=categorical_feature\n)","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:52:49.590366Z","iopub.execute_input":"2022-01-01T11:52:49.590604Z","iopub.status.idle":"2022-01-01T11:53:00.495917Z","shell.execute_reply.started":"2022-01-01T11:52:49.590576Z","shell.execute_reply":"2022-01-01T11:53:00.495224Z"},"trusted":true},"execution_count":52,"outputs":[{"name":"stdout","text":"\n8 trials detected for ('learning_rate', 'num_leaves', 'max_depth')\n\ntrial: 0001 ### iterations: 00029 ### eval_score: 0.2036\ntrial: 0002 ### iterations: 00030 ### eval_score: 0.2034\ntrial: 0003 ### iterations: 00027 ### eval_score: 0.20617\ntrial: 0004 ### iterations: 00024 ### eval_score: 0.20003\ntrial: 0005 ### iterations: 00060 ### eval_score: 0.20332\ntrial: 0006 ### iterations: 00063 ### eval_score: 0.20329\ntrial: 0007 ### iterations: 00054 ### eval_score: 0.20136\ntrial: 0008 ### iterations: 00052 ### eval_score: 0.19959\n","output_type":"stream"},{"execution_count":52,"output_type":"execute_result","data":{"text/plain":"BoostRFE(estimator=LGBMClassifier(n_estimators=150, random_state=0),\n         min_features_to_select=1,\n         param_grid={'learning_rate': [0.2, 0.1], 'max_depth': [10, 12],\n                     'num_leaves': [25, 35]})"},"metadata":{}}]},{"cell_type":"code","source":"X_clf_train = pd.DataFrame(X_clf_train)\nX_clf_train[categorical_feature] = X_clf_train[categorical_feature].astype('category')\n\nX_clf_valid = pd.DataFrame(X_clf_valid)\nX_clf_valid[categorical_feature] = X_clf_valid[categorical_feature].astype('category')","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:53:00.499198Z","iopub.execute_input":"2022-01-01T11:53:00.499858Z","iopub.status.idle":"2022-01-01T11:53:00.527402Z","shell.execute_reply.started":"2022-01-01T11:53:00.499814Z","shell.execute_reply":"2022-01-01T11:53:00.526779Z"},"trusted":true},"execution_count":53,"outputs":[]},{"cell_type":"code","source":"### PASS category COLUMNS IN PANDAS DF ###\n\nmodel = BoostRFE(clf_lgbm, param_grid=param_grid, min_features_to_select=1, step=1)\nmodel.fit(X_clf_train, y_clf_train, eval_set=[(X_clf_valid, y_clf_valid)], early_stopping_rounds=6, verbose=0)","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:53:00.529027Z","iopub.execute_input":"2022-01-01T11:53:00.529320Z","iopub.status.idle":"2022-01-01T11:53:12.422092Z","shell.execute_reply.started":"2022-01-01T11:53:00.529281Z","shell.execute_reply":"2022-01-01T11:53:12.421368Z"},"trusted":true},"execution_count":54,"outputs":[{"name":"stdout","text":"\n8 trials detected for ('learning_rate', 'num_leaves', 'max_depth')\n\ntrial: 0001 ### iterations: 00029 ### eval_score: 0.2036\ntrial: 0002 ### iterations: 00030 ### eval_score: 0.2034\ntrial: 0003 ### iterations: 00027 ### eval_score: 0.20617\ntrial: 0004 ### iterations: 00024 ### eval_score: 0.20003\ntrial: 0005 ### iterations: 00060 ### eval_score: 0.20332\ntrial: 0006 ### iterations: 00063 ### eval_score: 0.20329\ntrial: 0007 ### iterations: 00054 ### eval_score: 0.20136\ntrial: 0008 ### iterations: 00052 ### eval_score: 0.19959\n","output_type":"stream"},{"execution_count":54,"output_type":"execute_result","data":{"text/plain":"BoostRFE(estimator=LGBMClassifier(n_estimators=150, random_state=0),\n         min_features_to_select=1,\n         param_grid={'learning_rate': [0.2, 0.1], 'max_depth': [10, 12],\n                     'num_leaves': [25, 35]})"},"metadata":{}}]}]}

================================================
FILE: notebooks/XGBoost_usage.ipynb
================================================
{"metadata":{"kernelspec":{"language":"python","display_name":"Python 3","name":"python3"},"language_info":{"name":"python","version":"3.7.12","mimetype":"text/x-python","codemirror_mode":{"name":"ipython","version":3},"pygments_lexer":"ipython3","nbconvert_exporter":"python","file_extension":".py"}},"nbformat_minor":4,"nbformat":4,"cells":[{"cell_type":"code","source":"import numpy as np\nimport pandas as pd\nfrom scipy import stats\n\nfrom sklearn.model_selection import train_test_split\nfrom sklearn.datasets import make_classification, make_regression\n\nfrom hyperopt import hp\nfrom hyperopt import Trials\n\nfrom xgboost import *\n\ntry:\n    from shaphypetune import BoostSearch, BoostBoruta, BoostRFE, BoostRFA\nexcept:\n    !pip install --upgrade shap-hypetune\n    from shaphypetune import BoostSearch, BoostBoruta, BoostRFE, BoostRFA\n\nimport warnings\nwarnings.simplefilter('ignore')","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:49:44.031173Z","iopub.execute_input":"2022-01-01T11:49:44.031497Z","iopub.status.idle":"2022-01-01T11:49:45.071830Z","shell.execute_reply.started":"2022-01-01T11:49:44.031410Z","shell.execute_reply":"2022-01-01T11:49:45.070928Z"},"trusted":true},"execution_count":1,"outputs":[]},{"cell_type":"code","source":"X_clf, y_clf = make_classification(n_samples=6000, n_features=20, n_classes=2, \n                                   n_informative=4, n_redundant=6, random_state=0)\n\nX_clf_train, X_clf_valid, y_clf_train, y_clf_valid = train_test_split(\n    X_clf, y_clf, test_size=0.3, shuffle=False)\n\nX_regr, y_regr = make_classification(n_samples=6000, n_features=20,\n                                     n_informative=7, random_state=0)\n\nX_regr_train, X_regr_valid, y_regr_train, y_regr_valid = train_test_split(\n    X_regr, y_regr, test_size=0.3, shuffle=False)","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:49:45.073832Z","iopub.execute_input":"2022-01-01T11:49:45.074046Z","iopub.status.idle":"2022-01-01T11:49:45.098178Z","shell.execute_reply.started":"2022-01-01T11:49:45.074004Z","shell.execute_reply":"2022-01-01T11:49:45.097461Z"},"trusted":true},"execution_count":2,"outputs":[]},{"cell_type":"code","source":"param_grid = {\n    'learning_rate': [0.2, 0.1],\n    'num_leaves': [25, 35],\n    'max_depth': [10, 12]\n}\n\nparam_dist = {\n    'learning_rate': stats.uniform(0.09, 0.25),\n    'num_leaves': stats.randint(20,40),\n    'max_depth': [10, 12]\n}\n\nparam_dist_hyperopt = {\n    'max_depth': 15 + hp.randint('num_leaves', 5), \n    'learning_rate': hp.loguniform('learning_rate', np.log(0.01), np.log(0.2)),\n    'colsample_bytree': hp.uniform('colsample_by_tree', 0.6, 1.0)\n}\n\n\nregr_xgb = XGBRegressor(n_estimators=150, random_state=0, verbosity=0, n_jobs=-1)\nclf_xgb = XGBClassifier(n_estimators=150, random_state=0, verbosity=0, n_jobs=-1)","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:49:45.099715Z","iopub.execute_input":"2022-01-01T11:49:45.099916Z","iopub.status.idle":"2022-01-01T11:49:45.108765Z","shell.execute_reply.started":"2022-01-01T11:49:45.099890Z","shell.execute_reply":"2022-01-01T11:49:45.107996Z"},"trusted":true},"execution_count":3,"outputs":[]},{"cell_type":"markdown","source":"# Hyperparameters Tuning","metadata":{}},{"cell_type":"code","source":"### HYPERPARAM TUNING WITH GRID-SEARCH ###\n\nmodel = BoostSearch(clf_xgb, param_grid=param_grid)\nmodel.fit(X_clf_train, y_clf_train, eval_set=[(X_clf_valid, y_clf_valid)], early_stopping_rounds=6, verbose=0)","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:49:45.109686Z","iopub.execute_input":"2022-01-01T11:49:45.109871Z","iopub.status.idle":"2022-01-01T11:49:52.490942Z","shell.execute_reply.started":"2022-01-01T11:49:45.109848Z","shell.execute_reply":"2022-01-01T11:49:52.490078Z"},"trusted":true},"execution_count":4,"outputs":[{"name":"stdout","text":"\n8 trials detected for ('learning_rate', 'num_leaves', 'max_depth')\n\ntrial: 0001 ### iterations: 00021 ### eval_score: 0.2045\ntrial: 0002 ### iterations: 00026 ### eval_score: 0.19472\ntrial: 0003 ### iterations: 00021 ### eval_score: 0.2045\ntrial: 0004 ### iterations: 00026 ### eval_score: 0.19472\ntrial: 0005 ### iterations: 00045 ### eval_score: 0.19964\ntrial: 0006 ### iterations: 00050 ### eval_score: 0.20157\ntrial: 0007 ### iterations: 00045 ### eval_score: 0.19964\ntrial: 0008 ### iterations: 00050 ### eval_score: 0.20157\n","output_type":"stream"},{"execution_count":4,"output_type":"execute_result","data":{"text/plain":"BoostSearch(estimator=XGBClassifier(base_score=None, booster=None,\n                                    colsample_bylevel=None,\n                                    colsample_bynode=None,\n                                    colsample_bytree=None,\n                                    enable_categorical=False, gamma=None,\n                                    gpu_id=None, importance_type=None,\n                                    interaction_constraints=None,\n                                    learning_rate=None, max_delta_step=None,\n                                    max_depth=None, min_child_weight=None,\n                                    missing=nan, monotone_constraints=None,\n                                    n_estimators=150, n_jobs=-1,\n                                    num_parallel_tree=None, predictor=None,\n                                    random_state=0, reg_alpha=None,\n                                    reg_lambda=None, scale_pos_weight=None,\n                                    subsample=None, tree_method=None,\n                                    validate_parameters=None, verbosity=0),\n            param_grid={'learning_rate': [0.2, 0.1], 'max_depth': [10, 12],\n                        'num_leaves': [25, 35]})"},"metadata":{}}]},{"cell_type":"code","source":"model.estimator_, model.best_params_, model.best_score_","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:49:52.493607Z","iopub.execute_input":"2022-01-01T11:49:52.494126Z","iopub.status.idle":"2022-01-01T11:49:52.504649Z","shell.execute_reply.started":"2022-01-01T11:49:52.494081Z","shell.execute_reply":"2022-01-01T11:49:52.503849Z"},"trusted":true},"execution_count":5,"outputs":[{"execution_count":5,"output_type":"execute_result","data":{"text/plain":"(XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,\n               colsample_bynode=1, colsample_bytree=1, enable_categorical=False,\n               gamma=0, gpu_id=-1, importance_type=None,\n               interaction_constraints='', learning_rate=0.2, max_delta_step=0,\n               max_depth=12, min_child_weight=1, missing=nan,\n               monotone_constraints='()', n_estimators=150, n_jobs=-1,\n               num_leaves=25, num_parallel_tree=1, predictor='auto',\n               random_state=0, reg_alpha=0, reg_lambda=1, scale_pos_weight=1,\n               subsample=1, tree_method='exact', validate_parameters=1,\n               verbosity=0),\n {'learning_rate': 0.2, 'num_leaves': 25, 'max_depth': 12},\n 0.194719)"},"metadata":{}}]},{"cell_type":"code","source":"(model.score(X_clf_valid, y_clf_valid), \n model.predict(X_clf_valid).shape, \n model.predict_proba(X_clf_valid).shape)","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:49:52.506201Z","iopub.execute_input":"2022-01-01T11:49:52.506365Z","iopub.status.idle":"2022-01-01T11:49:52.528604Z","shell.execute_reply.started":"2022-01-01T11:49:52.506344Z","shell.execute_reply":"2022-01-01T11:49:52.528078Z"},"trusted":true},"execution_count":6,"outputs":[{"execution_count":6,"output_type":"execute_result","data":{"text/plain":"(0.9138888888888889, (1800,), (1800, 2))"},"metadata":{}}]},{"cell_type":"code","source":"### HYPERPARAM TUNING WITH RANDOM-SEARCH ###\n\nmodel = BoostSearch(\n    regr_xgb, param_grid=param_dist,\n    n_iter=8, sampling_seed=0\n)\nmodel.fit(X_regr_train, y_regr_train, eval_set=[(X_regr_valid, y_regr_valid)], early_stopping_rounds=6, verbose=0)","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:49:52.529476Z","iopub.execute_input":"2022-01-01T11:49:52.530097Z","iopub.status.idle":"2022-01-01T11:50:03.018637Z","shell.execute_reply.started":"2022-01-01T11:49:52.530066Z","shell.execute_reply":"2022-01-01T11:50:03.017927Z"},"trusted":true},"execution_count":7,"outputs":[{"name":"stdout","text":"\n8 trials detected for ('learning_rate', 'num_leaves', 'max_depth')\n\ntrial: 0001 ### iterations: 00012 ### eval_score: 0.27616\ntrial: 0002 ### iterations: 00056 ### eval_score: 0.26211\ntrial: 0003 ### iterations: 00078 ### eval_score: 0.27603\ntrial: 0004 ### iterations: 00045 ### eval_score: 0.26117\ntrial: 0005 ### iterations: 00046 ### eval_score: 0.27868\ntrial: 0006 ### iterations: 00035 ### eval_score: 0.27815\ntrial: 0007 ### iterations: 00039 ### eval_score: 0.2753\ntrial: 0008 ### iterations: 00016 ### eval_score: 0.28116\n","output_type":"stream"},{"execution_count":7,"output_type":"execute_result","data":{"text/plain":"BoostSearch(estimator=XGBRegressor(base_score=None, booster=None,\n                                   colsample_bylevel=None,\n                                   colsample_bynode=None, colsample_bytree=None,\n                                   enable_categorical=False, gamma=None,\n                                   gpu_id=None, importance_type=None,\n                                   interaction_constraints=None,\n                                   learning_rate=None, max_delta_step=None,\n                                   max_depth=None, min_child_weight=None,\n                                   missing=nan, monotone_constraints=None,\n                                   n_estim...\n                                   random_state=0, reg_alpha=None,\n                                   reg_lambda=None, scale_pos_weight=None,\n                                   subsample=None, tree_method=None,\n                                   validate_parameters=None, verbosity=0),\n            n_iter=8,\n            param_grid={'learning_rate': <scipy.stats._distn_infrastructure.rv_frozen object at 0x7f5d29426490>,\n                        'max_depth': [10, 12],\n                        'num_leaves': <scipy.stats._distn_infrastructure.rv_frozen object at 0x7f5d29426710>},\n            sampling_seed=0)"},"metadata":{}}]},{"cell_type":"code","source":"model.estimator_, model.best_params_, model.best_score_","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:50:03.019747Z","iopub.execute_input":"2022-01-01T11:50:03.020416Z","iopub.status.idle":"2022-01-01T11:50:03.030730Z","shell.execute_reply.started":"2022-01-01T11:50:03.020379Z","shell.execute_reply":"2022-01-01T11:50:03.030065Z"},"trusted":true},"execution_count":8,"outputs":[{"execution_count":8,"output_type":"execute_result","data":{"text/plain":"(XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,\n              colsample_bynode=1, colsample_bytree=1, enable_categorical=False,\n              gamma=0, gpu_id=-1, importance_type=None,\n              interaction_constraints='', learning_rate=0.1669837381562427,\n              max_delta_step=0, max_depth=10, min_child_weight=1, missing=nan,\n              monotone_constraints='()', n_estimators=150, n_jobs=-1,\n              num_leaves=25, num_parallel_tree=1, predictor='auto',\n              random_state=0, reg_alpha=0, reg_lambda=1, scale_pos_weight=1,\n              subsample=1, tree_method='exact', validate_parameters=1,\n              verbosity=0),\n {'learning_rate': 0.1669837381562427, 'num_leaves': 25, 'max_depth': 10},\n 0.26117)"},"metadata":{}}]},{"cell_type":"code","source":"(model.score(X_regr_valid, y_regr_valid), \n model.predict(X_regr_valid).shape)","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:50:03.032146Z","iopub.execute_input":"2022-01-01T11:50:03.032612Z","iopub.status.idle":"2022-01-01T11:50:03.058721Z","shell.execute_reply.started":"2022-01-01T11:50:03.032572Z","shell.execute_reply":"2022-01-01T11:50:03.058084Z"},"trusted":true},"execution_count":9,"outputs":[{"execution_count":9,"output_type":"execute_result","data":{"text/plain":"(0.7271524639165458, (1800,))"},"metadata":{}}]},{"cell_type":"code","source":"### HYPERPARAM TUNING WITH HYPEROPT ###\n\nmodel = BoostSearch(\n    regr_xgb, param_grid=param_dist_hyperopt,\n    n_iter=8, sampling_seed=0\n)\nmodel.fit(\n    X_regr_train, y_regr_train, trials=Trials(), \n    eval_set=[(X_regr_valid, y_regr_valid)], early_stopping_rounds=6, verbose=0\n)","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:50:03.059800Z","iopub.execute_input":"2022-01-01T11:50:03.062204Z","iopub.status.idle":"2022-01-01T11:50:32.323625Z","shell.execute_reply.started":"2022-01-01T11:50:03.062158Z","shell.execute_reply":"2022-01-01T11:50:32.322789Z"},"trusted":true},"execution_count":10,"outputs":[{"name":"stdout","text":"\n8 trials detected for ('max_depth', 'learning_rate', 'colsample_bytree')\n\ntrial: 0001 ### iterations: 00149 ### eval_score: 0.27498\ntrial: 0002 ### iterations: 00074 ### eval_score: 0.27186\ntrial: 0003 ### iterations: 00038 ### eval_score: 0.28326\ntrial: 0004 ### iterations: 00149 ### eval_score: 0.29455\ntrial: 0005 ### iterations: 00149 ### eval_score: 0.28037\ntrial: 0006 ### iterations: 00149 ### eval_score: 0.26421\ntrial: 0007 ### iterations: 00052 ### eval_score: 0.27191\ntrial: 0008 ### iterations: 00133 ### eval_score: 0.29251\n","output_type":"stream"},{"execution_count":10,"output_type":"execute_result","data":{"text/plain":"BoostSearch(estimator=XGBRegressor(base_score=None, booster=None,\n                                   colsample_bylevel=None,\n                                   colsample_bynode=None, colsample_bytree=None,\n                                   enable_categorical=False, gamma=None,\n                                   gpu_id=None, importance_type=None,\n                                   interaction_constraints=None,\n                                   learning_rate=None, max_delta_step=None,\n                                   max_depth=None, min_child_weight=None,\n                                   missing=nan, monotone_constraints=None,\n                                   n_estim...\n                                   random_state=0, reg_alpha=None,\n                                   reg_lambda=None, scale_pos_weight=None,\n                                   subsample=None, tree_method=None,\n                                   validate_parameters=None, verbosity=0),\n            n_iter=8,\n            param_grid={'colsample_bytree': <hyperopt.pyll.base.Apply object at 0x7f5d29426890>,\n                        'learning_rate': <hyperopt.pyll.base.Apply object at 0x7f5d29426a50>,\n                        'max_depth': <hyperopt.pyll.base.Apply object at 0x7f5d29426790>},\n            sampling_seed=0)"},"metadata":{}}]},{"cell_type":"code","source":"model.estimator_, model.best_params_, model.best_score_","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:50:32.324994Z","iopub.execute_input":"2022-01-01T11:50:32.325480Z","iopub.status.idle":"2022-01-01T11:50:32.335828Z","shell.execute_reply.started":"2022-01-01T11:50:32.325441Z","shell.execute_reply":"2022-01-01T11:50:32.334970Z"},"trusted":true},"execution_count":11,"outputs":[{"execution_count":11,"output_type":"execute_result","data":{"text/plain":"(XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,\n              colsample_bynode=1, colsample_bytree=0.7597292534356749,\n              enable_categorical=False, gamma=0, gpu_id=-1, importance_type=None,\n              interaction_constraints='', learning_rate=0.059836658149176665,\n              max_delta_step=0, max_depth=16, min_child_weight=1, missing=nan,\n              monotone_constraints='()', n_estimators=150, n_jobs=-1,\n              num_parallel_tree=1, predictor='auto', random_state=0, reg_alpha=0,\n              reg_lambda=1, scale_pos_weight=1, subsample=1, tree_method='exact',\n              validate_parameters=1, verbosity=0),\n {'colsample_bytree': 0.7597292534356749,\n  'learning_rate': 0.059836658149176665,\n  'max_depth': 16},\n 0.264211)"},"metadata":{}}]},{"cell_type":"code","source":"(model.score(X_regr_valid, y_regr_valid), \n model.predict(X_regr_valid).shape)","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:50:32.337011Z","iopub.execute_input":"2022-01-01T11:50:32.337395Z","iopub.status.idle":"2022-01-01T11:50:32.370381Z","shell.execute_reply.started":"2022-01-01T11:50:32.337369Z","shell.execute_reply":"2022-01-01T11:50:32.369816Z"},"trusted":true},"execution_count":12,"outputs":[{"execution_count":12,"output_type":"execute_result","data":{"text/plain":"(0.7207605727361562, (1800,))"},"metadata":{}}]},{"cell_type":"markdown","source":"# Features Selection","metadata":{}},{"cell_type":"code","source":"### BORUTA ###\n\nmodel = BoostBoruta(clf_xgb, max_iter=200, perc=100)\nmodel.fit(X_clf_train, y_clf_train, eval_set=[(X_clf_valid, y_clf_valid)], early_stopping_rounds=6, verbose=0)","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:50:32.371634Z","iopub.execute_input":"2022-01-01T11:50:32.372109Z","iopub.status.idle":"2022-01-01T11:50:50.797541Z","shell.execute_reply.started":"2022-01-01T11:50:32.372066Z","shell.execute_reply":"2022-01-01T11:50:50.797059Z"},"trusted":true},"execution_count":13,"outputs":[{"execution_count":13,"output_type":"execute_result","data":{"text/plain":"BoostBoruta(estimator=XGBClassifier(base_score=None, booster=None,\n                                    colsample_bylevel=None,\n                                    colsample_bynode=None,\n                                    colsample_bytree=None,\n                                    enable_categorical=False, gamma=None,\n                                    gpu_id=None, importance_type=None,\n                                    interaction_constraints=None,\n                                    learning_rate=None, max_delta_step=None,\n                                    max_depth=None, min_child_weight=None,\n                                    missing=nan, monotone_constraints=None,\n                                    n_estimators=150, n_jobs=-1,\n                                    num_parallel_tree=None, predictor=None,\n                                    random_state=0, reg_alpha=None,\n                                    reg_lambda=None, scale_pos_weight=None,\n                                    subsample=None, tree_method=None,\n                                    validate_parameters=None, verbosity=0),\n            max_iter=200)"},"metadata":{}}]},{"cell_type":"code","source":"model.estimator_, model.n_features_","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:50:50.800394Z","iopub.execute_input":"2022-01-01T11:50:50.800795Z","iopub.status.idle":"2022-01-01T11:50:50.809566Z","shell.execute_reply.started":"2022-01-01T11:50:50.800767Z","shell.execute_reply":"2022-01-01T11:50:50.808911Z"},"trusted":true},"execution_count":14,"outputs":[{"execution_count":14,"output_type":"execute_result","data":{"text/plain":"(XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,\n               colsample_bynode=1, colsample_bytree=1, enable_categorical=False,\n               gamma=0, gpu_id=-1, importance_type=None,\n               interaction_constraints='', learning_rate=0.300000012,\n               max_delta_step=0, max_depth=6, min_child_weight=1, missing=nan,\n               monotone_constraints='()', n_estimators=150, n_jobs=-1,\n               num_parallel_tree=1, predictor='auto', random_state=0,\n               reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1,\n               tree_method='exact', validate_parameters=1, verbosity=0),\n 11)"},"metadata":{}}]},{"cell_type":"code","source":"(model.score(X_clf_valid, y_clf_valid), \n model.predict(X_clf_valid).shape, \n model.transform(X_clf_valid).shape,\n model.predict_proba(X_clf_valid).shape)","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:50:50.810633Z","iopub.execute_input":"2022-01-01T11:50:50.811078Z","iopub.status.idle":"2022-01-01T11:50:50.834426Z","shell.execute_reply.started":"2022-01-01T11:50:50.811040Z","shell.execute_reply":"2022-01-01T11:50:50.833776Z"},"trusted":true},"execution_count":15,"outputs":[{"execution_count":15,"output_type":"execute_result","data":{"text/plain":"(0.9161111111111111, (1800,), (1800, 11), (1800, 2))"},"metadata":{}}]},{"cell_type":"code","source":"### RECURSIVE FEATURE ELIMINATION (RFE) ###\n\nmodel = BoostRFE(regr_xgb, min_features_to_select=1, step=1)\nmodel.fit(X_regr_train, y_regr_train, eval_set=[(X_regr_valid, y_regr_valid)], early_stopping_rounds=6, verbose=0)","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:50:50.835608Z","iopub.execute_input":"2022-01-01T11:50:50.836142Z","iopub.status.idle":"2022-01-01T11:50:58.558180Z","shell.execute_reply.started":"2022-01-01T11:50:50.836100Z","shell.execute_reply":"2022-01-01T11:50:58.557365Z"},"trusted":true},"execution_count":16,"outputs":[{"execution_count":16,"output_type":"execute_result","data":{"text/plain":"BoostRFE(estimator=XGBRegressor(base_score=None, booster=None,\n                                colsample_bylevel=None, colsample_bynode=None,\n                                colsample_bytree=None, enable_categorical=False,\n                                gamma=None, gpu_id=None, importance_type=None,\n                                interaction_constraints=None,\n                                learning_rate=None, max_delta_step=None,\n                                max_depth=None, min_child_weight=None,\n                                missing=nan, monotone_constraints=None,\n                                n_estimators=150, n_jobs=-1,\n                                num_parallel_tree=None, predictor=None,\n                                random_state=0, reg_alpha=None, reg_lambda=None,\n                                scale_pos_weight=None, subsample=None,\n                                tree_method=None, validate_parameters=None,\n                                verbosity=0),\n         min_features_to_select=1)"},"metadata":{}}]},{"cell_type":"code","source":"model.estimator_, model.n_features_","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:50:58.559585Z","iopub.execute_input":"2022-01-01T11:50:58.560110Z","iopub.status.idle":"2022-01-01T11:50:58.569301Z","shell.execute_reply.started":"2022-01-01T11:50:58.560048Z","shell.execute_reply":"2022-01-01T11:50:58.568542Z"},"trusted":true},"execution_count":17,"outputs":[{"execution_count":17,"output_type":"execute_result","data":{"text/plain":"(XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,\n              colsample_bynode=1, colsample_bytree=1, enable_categorical=False,\n              gamma=0, gpu_id=-1, importance_type=None,\n              interaction_constraints='', learning_rate=0.300000012,\n              max_delta_step=0, max_depth=6, min_child_weight=1, missing=nan,\n              monotone_constraints='()', n_estimators=150, n_jobs=-1,\n              num_parallel_tree=1, predictor='auto', random_state=0, reg_alpha=0,\n              reg_lambda=1, scale_pos_weight=1, subsample=1, tree_method='exact',\n              validate_parameters=1, verbosity=0),\n 7)"},"metadata":{}}]},{"cell_type":"code","source":"(model.score(X_regr_valid, y_regr_valid), \n model.predict(X_regr_valid).shape, \n model.transform(X_regr_valid).shape)","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:50:58.570558Z","iopub.execute_input":"2022-01-01T11:50:58.570828Z","iopub.status.idle":"2022-01-01T11:50:58.584624Z","shell.execute_reply.started":"2022-01-01T11:50:58.570792Z","shell.execute_reply":"2022-01-01T11:50:58.584081Z"},"trusted":true},"execution_count":18,"outputs":[{"execution_count":18,"output_type":"execute_result","data":{"text/plain":"(0.7317444492376407, (1800,), (1800, 7))"},"metadata":{}}]},{"cell_type":"code","source":"### RECURSIVE FEATURE ADDITION (RFA) ###\n\nmodel = BoostRFA(regr_xgb, min_features_to_select=1, step=1)\nmodel.fit(X_regr_train, y_regr_train, eval_set=[(X_regr_valid, y_regr_valid)], early_stopping_rounds=6, verbose=0)","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:50:58.585749Z","iopub.execute_input":"2022-01-01T11:50:58.586163Z","iopub.status.idle":"2022-01-01T11:51:09.404587Z","shell.execute_reply.started":"2022-01-01T11:50:58.586126Z","shell.execute_reply":"2022-01-01T11:51:09.403781Z"},"trusted":true},"execution_count":19,"outputs":[{"execution_count":19,"output_type":"execute_result","data":{"text/plain":"BoostRFA(estimator=XGBRegressor(base_score=None, booster=None,\n                                colsample_bylevel=None, colsample_bynode=None,\n                                colsample_bytree=None, enable_categorical=False,\n                                gamma=None, gpu_id=None, importance_type=None,\n                                interaction_constraints=None,\n                                learning_rate=None, max_delta_step=None,\n                                max_depth=None, min_child_weight=None,\n                                missing=nan, monotone_constraints=None,\n                                n_estimators=150, n_jobs=-1,\n                                num_parallel_tree=None, predictor=None,\n                                random_state=0, reg_alpha=None, reg_lambda=None,\n                                scale_pos_weight=None, subsample=None,\n                                tree_method=None, validate_parameters=None,\n                                verbosity=0),\n         min_features_to_select=1)"},"metadata":{}}]},{"cell_type":"code","source":"model.estimator_, model.n_features_","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:51:09.406057Z","iopub.execute_input":"2022-01-01T11:51:09.406434Z","iopub.status.idle":"2022-01-01T11:51:09.416068Z","shell.execute_reply.started":"2022-01-01T11:51:09.406399Z","shell.execute_reply":"2022-01-01T11:51:09.415411Z"},"trusted":true},"execution_count":20,"outputs":[{"execution_count":20,"output_type":"execute_result","data":{"text/plain":"(XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,\n              colsample_bynode=1, colsample_bytree=1, enable_categorical=False,\n              gamma=0, gpu_id=-1, importance_type=None,\n              interaction_constraints='', learning_rate=0.300000012,\n              max_delta_step=0, max_depth=6, min_child_weight=1, missing=nan,\n              monotone_constraints='()', n_estimators=150, n_jobs=-1,\n              num_parallel_tree=1, predictor='auto', random_state=0, reg_alpha=0,\n              reg_lambda=1, scale_pos_weight=1, subsample=1, tree_method='exact',\n              validate_parameters=1, verbosity=0),\n 8)"},"metadata":{}}]},{"cell_type":"code","source":"(model.score(X_regr_valid, y_regr_valid), \n model.predict(X_regr_valid).shape, \n model.transform(X_regr_valid).shape)","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:51:09.417248Z","iopub.execute_input":"2022-01-01T11:51:09.417698Z","iopub.status.idle":"2022-01-01T11:51:09.450280Z","shell.execute_reply.started":"2022-01-01T11:51:09.417657Z","shell.execute_reply":"2022-01-01T11:51:09.449664Z"},"trusted":true},"execution_count":21,"outputs":[{"execution_count":21,"output_type":"execute_result","data":{"text/plain":"(0.7274037362877257, (1800,), (1800, 8))"},"metadata":{}}]},{"cell_type":"markdown","source":"# Features Selection with SHAP","metadata":{}},{"cell_type":"code","source":"### BORUTA SHAP ###\n\nmodel = BoostBoruta(\n    clf_xgb, max_iter=200, perc=100,\n    importance_type='shap_importances', train_importance=False\n)\nmodel.fit(X_clf_train, y_clf_train, eval_set=[(X_clf_valid, y_clf_valid)], early_stopping_rounds=6, verbose=0)","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:51:09.451169Z","iopub.execute_input":"2022-01-01T11:51:09.451507Z","iopub.status.idle":"2022-01-01T11:51:33.925757Z","shell.execute_reply.started":"2022-01-01T11:51:09.451482Z","shell.execute_reply":"2022-01-01T11:51:33.925076Z"},"trusted":true},"execution_count":22,"outputs":[{"execution_count":22,"output_type":"execute_result","data":{"text/plain":"BoostBoruta(estimator=XGBClassifier(base_score=None, booster=None,\n                                    colsample_bylevel=None,\n                                    colsample_bynode=None,\n                                    colsample_bytree=None,\n                                    enable_categorical=False, gamma=None,\n                                    gpu_id=None, importance_type=None,\n                                    interaction_constraints=None,\n                                    learning_rate=None, max_delta_step=None,\n                                    max_depth=None, min_child_weight=None,\n                                    missing=nan, monotone_constraints=None,\n                                    n_estimators=150, n_jobs=-1,\n                                    num_parallel_tree=None, predictor=None,\n                                    random_state=0, reg_alpha=None,\n                                    reg_lambda=None, scale_pos_weight=None,\n                                    subsample=None, tree_method=None,\n                                    validate_parameters=None, verbosity=0),\n            importance_type='shap_importances', max_iter=200,\n            train_importance=False)"},"metadata":{}}]},{"cell_type":"code","source":"model.estimator_, model.n_features_","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:51:33.926762Z","iopub.execute_input":"2022-01-01T11:51:33.926940Z","iopub.status.idle":"2022-01-01T11:51:33.934907Z","shell.execute_reply.started":"2022-01-01T11:51:33.926918Z","shell.execute_reply":"2022-01-01T11:51:33.934315Z"},"trusted":true},"execution_count":23,"outputs":[{"execution_count":23,"output_type":"execute_result","data":{"text/plain":"(XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,\n               colsample_bynode=1, colsample_bytree=1, enable_categorical=False,\n               gamma=0, gpu_id=-1, importance_type=None,\n               interaction_constraints='', learning_rate=0.300000012,\n               max_delta_step=0, max_depth=6, min_child_weight=1, missing=nan,\n               monotone_constraints='()', n_estimators=150, n_jobs=-1,\n               num_parallel_tree=1, predictor='auto', random_state=0,\n               reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1,\n               tree_method='exact', validate_parameters=1, verbosity=0),\n 10)"},"metadata":{}}]},{"cell_type":"code","source":"(model.score(X_clf_valid, y_clf_valid), \n model.predict(X_clf_valid).shape, \n model.transform(X_clf_valid).shape,\n model.predict_proba(X_clf_valid).shape)","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:51:33.935950Z","iopub.execute_input":"2022-01-01T11:51:33.936419Z","iopub.status.idle":"2022-01-01T11:51:33.961319Z","shell.execute_reply.started":"2022-01-01T11:51:33.936381Z","shell.execute_reply":"2022-01-01T11:51:33.960533Z"},"trusted":true},"execution_count":24,"outputs":[{"execution_count":24,"output_type":"execute_result","data":{"text/plain":"(0.91, (1800,), (1800, 10), (1800, 2))"},"metadata":{}}]},{"cell_type":"code","source":"### RECURSIVE FEATURE ELIMINATION (RFE) SHAP ###\n\nmodel = BoostRFE(\n    regr_xgb, min_features_to_select=1, step=1,\n    importance_type='shap_importances', train_importance=False\n)\nmodel.fit(X_regr_train, y_regr_train, eval_set=[(X_regr_valid, y_regr_valid)], early_stopping_rounds=6, verbose=0)","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:51:33.962369Z","iopub.execute_input":"2022-01-01T11:51:33.962555Z","iopub.status.idle":"2022-01-01T11:51:47.059712Z","shell.execute_reply.started":"2022-01-01T11:51:33.962532Z","shell.execute_reply":"2022-01-01T11:51:47.058892Z"},"trusted":true},"execution_count":25,"outputs":[{"execution_count":25,"output_type":"execute_result","data":{"text/plain":"BoostRFE(estimator=XGBRegressor(base_score=None, booster=None,\n                                colsample_bylevel=None, colsample_bynode=None,\n                                colsample_bytree=None, enable_categorical=False,\n                                gamma=None, gpu_id=None, importance_type=None,\n                                interaction_constraints=None,\n                                learning_rate=None, max_delta_step=None,\n                                max_depth=None, min_child_weight=None,\n                                missing=nan, monotone_constraints=None,\n                                n_estimators=150, n_jobs=-1,\n                                num_parallel_tree=None, predictor=None,\n                                random_state=0, reg_alpha=None, reg_lambda=None,\n                                scale_pos_weight=None, subsample=None,\n                                tree_method=None, validate_parameters=None,\n                                verbosity=0),\n         importance_type='shap_importances', min_features_to_select=1,\n         train_importance=False)"},"metadata":{}}]},{"cell_type":"code","source":"model.estimator_, model.n_features_","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:51:47.060847Z","iopub.execute_input":"2022-01-01T11:51:47.061090Z","iopub.status.idle":"2022-01-01T11:51:47.069229Z","shell.execute_reply.started":"2022-01-01T11:51:47.061061Z","shell.execute_reply":"2022-01-01T11:51:47.068462Z"},"trusted":true},"execution_count":26,"outputs":[{"execution_count":26,"output_type":"execute_result","data":{"text/plain":"(XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,\n              colsample_bynode=1, colsample_bytree=1, enable_categorical=False,\n              gamma=0, gpu_id=-1, importance_type=None,\n              interaction_constraints='', learning_rate=0.300000012,\n              max_delta_step=0, max_depth=6, min_child_weight=1, missing=nan,\n              monotone_constraints='()', n_estimators=150, n_jobs=-1,\n              num_parallel_tree=1, predictor='auto', random_state=0, reg_alpha=0,\n              reg_lambda=1, scale_pos_weight=1, subsample=1, tree_method='exact',\n              validate_parameters=1, verbosity=0),\n 7)"},"metadata":{}}]},{"cell_type":"code","source":"(model.score(X_regr_valid, y_regr_valid), \n model.predict(X_regr_valid).shape, \n model.transform(X_regr_valid).shape)","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:51:47.070353Z","iopub.execute_input":"2022-01-01T11:51:47.071217Z","iopub.status.idle":"2022-01-01T11:51:47.087333Z","shell.execute_reply.started":"2022-01-01T11:51:47.071168Z","shell.execute_reply":"2022-01-01T11:51:47.086754Z"},"trusted":true},"execution_count":27,"outputs":[{"execution_count":27,"output_type":"execute_result","data":{"text/plain":"(0.7317444492376407, (1800,), (1800, 7))"},"metadata":{}}]},{"cell_type":"code","source":"### RECURSIVE FEATURE ADDITION (RFA) SHAP ###\n\nmodel = BoostRFA(\n    regr_xgb, min_features_to_select=1, step=1,\n    importance_type='shap_importances', train_importance=False\n)\nmodel.fit(\n    X_regr_train, y_regr_train, trials=Trials(), \n    eval_set=[(X_regr_valid, y_regr_valid)], early_stopping_rounds=6, verbose=0\n)","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:51:47.088455Z","iopub.execute_input":"2022-01-01T11:51:47.088921Z","iopub.status.idle":"2022-01-01T11:51:59.186202Z","shell.execute_reply.started":"2022-01-01T11:51:47.088885Z","shell.execute_reply":"2022-01-01T11:51:59.185431Z"},"trusted":true},"execution_count":28,"outputs":[{"execution_count":28,"output_type":"execute_result","data":{"text/plain":"BoostRFA(estimator=XGBRegressor(base_score=None, booster=None,\n                                colsample_bylevel=None, colsample_bynode=None,\n                                colsample_bytree=None, enable_categorical=False,\n                                gamma=None, gpu_id=None, importance_type=None,\n                                interaction_constraints=None,\n                                learning_rate=None, max_delta_step=None,\n                                max_depth=None, min_child_weight=None,\n                                missing=nan, monotone_constraints=None,\n                                n_estimators=150, n_jobs=-1,\n                                num_parallel_tree=None, predictor=None,\n                                random_state=0, reg_alpha=None, reg_lambda=None,\n                                scale_pos_weight=None, subsample=None,\n                                tree_method=None, validate_parameters=None,\n                                verbosity=0),\n         importance_type='shap_importances', min_features_to_select=1,\n         train_importance=False)"},"metadata":{}}]},{"cell_type":"code","source":"model.estimator_, model.n_features_","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:51:59.187276Z","iopub.execute_input":"2022-01-01T11:51:59.188081Z","iopub.status.idle":"2022-01-01T11:51:59.199276Z","shell.execute_reply.started":"2022-01-01T11:51:59.188004Z","shell.execute_reply":"2022-01-01T11:51:59.198325Z"},"trusted":true},"execution_count":29,"outputs":[{"execution_count":29,"output_type":"execute_result","data":{"text/plain":"(XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,\n              colsample_bynode=1, colsample_bytree=1, enable_categorical=False,\n              gamma=0, gpu_id=-1, importance_type=None,\n              interaction_constraints='', learning_rate=0.300000012,\n              max_delta_step=0, max_depth=6, min_child_weight=1, missing=nan,\n              monotone_constraints='()', n_estimators=150, n_jobs=-1,\n              num_parallel_tree=1, predictor='auto', random_state=0, reg_alpha=0,\n              reg_lambda=1, scale_pos_weight=1, subsample=1, tree_method='exact',\n              validate_parameters=1, verbosity=0),\n 9)"},"metadata":{}}]},{"cell_type":"code","source":"(model.score(X_regr_valid, y_regr_valid), \n model.predict(X_regr_valid).shape, \n model.transform(X_regr_valid).shape)","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:51:59.200366Z","iopub.execute_input":"2022-01-01T11:51:59.200640Z","iopub.status.idle":"2022-01-01T11:51:59.222774Z","shell.execute_reply.started":"2022-01-01T11:51:59.200592Z","shell.execute_reply":"2022-01-01T11:51:59.222078Z"},"trusted":true},"execution_count":30,"outputs":[{"execution_count":30,"output_type":"execute_result","data":{"text/plain":"(0.7249664284333042, (1800,), (1800, 9))"},"metadata":{}}]},{"cell_type":"markdown","source":"# Hyperparameters Tuning + Features Selection","metadata":{}},{"cell_type":"code","source":"### HYPERPARAM TUNING WITH GRID-SEARCH + BORUTA ###\n\nmodel = BoostBoruta(clf_xgb, param_grid=param_grid, max_iter=200, perc=100)\nmodel.fit(X_clf_train, y_clf_train, eval_set=[(X_clf_valid, y_clf_valid)], early_stopping_rounds=6, verbose=0)","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:51:59.224176Z","iopub.execute_input":"2022-01-01T11:51:59.224707Z","iopub.status.idle":"2022-01-01T12:14:09.045290Z","shell.execute_reply.started":"2022-01-01T11:51:59.224667Z","shell.execute_reply":"2022-01-01T12:14:09.044649Z"},"trusted":true},"execution_count":31,"outputs":[{"name":"stdout","text":"\n8 trials detected for ('learning_rate', 'num_leaves', 'max_depth')\n\ntrial: 0001 ### iterations: 00026 ### eval_score: 0.20001\ntrial: 0002 ### iterations: 00022 ### eval_score: 0.20348\ntrial: 0003 ### iterations: 00026 ### eval_score: 0.20001\ntrial: 0004 ### iterations: 00022 ### eval_score: 0.20348\ntrial: 0005 ### iterations: 00048 ### eval_score: 0.19925\ntrial: 0006 ### iterations: 00052 ### eval_score: 0.20307\ntrial: 0007 ### iterations: 00048 ### eval_score: 0.19925\ntrial: 0008 ### iterations: 00052 ### eval_score: 0.20307\n","output_type":"stream"},{"execution_count":31,"output_type":"execute_result","data":{"text/plain":"BoostBoruta(estimator=XGBClassifier(base_score=None, booster=None,\n                                    colsample_bylevel=None,\n                                    colsample_bynode=None,\n                                    colsample_bytree=None,\n                                    enable_categorical=False, gamma=None,\n                                    gpu_id=None, importance_type=None,\n                                    interaction_constraints=None,\n                                    learning_rate=None, max_delta_step=None,\n                                    max_depth=None, min_child_weight=None,\n                                    missing=nan, monotone_constraints=None,\n                                    n_estimators=150, n_jobs=-1,\n                                    num_parallel_tree=None, predictor=None,\n                                    random_state=0, reg_alpha=None,\n                                    reg_lambda=None, scale_pos_weight=None,\n                                    subsample=None, tree_method=None,\n                                    validate_parameters=None, verbosity=0),\n            max_iter=200,\n            param_grid={'learning_rate': [0.2, 0.1], 'max_depth': [10, 12],\n                        'num_leaves': [25, 35]})"},"metadata":{}}]},{"cell_type":"code","source":"model.estimator_, model.best_params_, model.best_score_, model.n_features_","metadata":{"execution":{"iopub.status.busy":"2022-01-01T12:14:09.046490Z","iopub.execute_input":"2022-01-01T12:14:09.047104Z","iopub.status.idle":"2022-01-01T12:14:09.056559Z","shell.execute_reply.started":"2022-01-01T12:14:09.047070Z","shell.execute_reply":"2022-01-01T12:14:09.056076Z"},"trusted":true},"execution_count":32,"outputs":[{"execution_count":32,"output_type":"execute_result","data":{"text/plain":"(XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,\n               colsample_bynode=1, colsample_bytree=1, enable_categorical=False,\n               gamma=0, gpu_id=-1, importance_type=None,\n               interaction_constraints='', learning_rate=0.1, max_delta_step=0,\n               max_depth=10, min_child_weight=1, missing=nan,\n               monotone_constraints='()', n_estimators=150, n_jobs=-1,\n               num_leaves=25, num_parallel_tree=1, predictor='auto',\n               random_state=0, reg_alpha=0, reg_lambda=1, scale_pos_weight=1,\n               subsample=1, tree_method='exact', validate_parameters=1,\n               verbosity=0),\n {'learning_rate': 0.1, 'num_leaves': 25, 'max_depth': 10},\n 0.199248,\n 11)"},"metadata":{}}]},{"cell_type":"code","source":"(model.score(X_clf_valid, y_clf_valid), \n model.predict(X_clf_valid).shape, \n model.transform(X_clf_valid).shape,\n model.predict_proba(X_clf_valid).shape)","metadata":{"execution":{"iopub.status.busy":"2022-01-01T12:14:09.057462Z","iopub.execute_input":"2022-01-01T12:14:09.057740Z","iopub.status.idle":"2022-01-01T12:14:09.086612Z","shell.execute_reply.started":"2022-01-01T12:14:09.057716Z","shell.execute_reply":"2022-01-01T12:14:09.085920Z"},"trusted":true},"execution_count":33,"outputs":[{"execution_count":33,"output_type":"execute_result","data":{"text/plain":"(0.9144444444444444, (1800,), (1800, 11), (1800, 2))"},"metadata":{}}]},{"cell_type":"code","source":"### HYPERPARAM TUNING WITH RANDOM-SEARCH + RECURSIVE FEATURE ELIMINATION (RFE) ###\n\nmodel = BoostRFE(\n    regr_xgb, param_grid=param_dist, min_features_to_select=1, step=1,\n    n_iter=8, sampling_seed=0\n)\nmodel.fit(X_regr_train, y_regr_train, eval_set=[(X_regr_valid, y_regr_valid)], early_stopping_rounds=6, verbose=0)","metadata":{"execution":{"iopub.status.busy":"2022-01-01T12:14:09.087595Z","iopub.execute_input":"2022-01-01T12:14:09.087798Z","iopub.status.idle":"2022-01-01T12:16:42.203604Z","shell.execute_reply.started":"2022-01-01T12:14:09.087772Z","shell.execute_reply":"2022-01-01T12:16:42.202743Z"},"trusted":true},"execution_count":34,"outputs":[{"name":"stdout","text":"\n8 trials detected for ('learning_rate', 'num_leaves', 'max_depth')\n\ntrial: 0001 ### iterations: 00021 ### eval_score: 0.25941\ntrial: 0002 ### iterations: 00077 ### eval_score: 0.25055\ntrial: 0003 ### iterations: 00086 ### eval_score: 0.25676\ntrial: 0004 ### iterations: 00098 ### eval_score: 0.25383\ntrial: 0005 ### iterations: 00050 ### eval_score: 0.25751\ntrial: 0006 ### iterations: 00028 ### eval_score: 0.26007\ntrial: 0007 ### iterations: 00084 ### eval_score: 0.2603\ntrial: 0008 ### iterations: 00024 ### eval_score: 0.26278\n","output_type":"stream"},{"execution_count":34,"output_type":"execute_result","data":{"text/plain":"BoostRFE(estimator=XGBRegressor(base_score=None, booster=None,\n                                colsample_bylevel=None, colsample_bynode=None,\n                                colsample_bytree=None, enable_categorical=False,\n                                gamma=None, gpu_id=None, importance_type=None,\n                                interaction_constraints=None,\n                                learning_rate=None, max_delta_step=None,\n                                max_depth=None, min_child_weight=None,\n                                missing=nan, monotone_constraints=None,\n                                n_estimato...\n                                random_state=0, reg_alpha=None, reg_lambda=None,\n                                scale_pos_weight=None, subsample=None,\n                                tree_method=None, validate_parameters=None,\n                                verbosity=0),\n         min_features_to_select=1, n_iter=8,\n         param_grid={'learning_rate': <scipy.stats._distn_infrastructure.rv_frozen object at 0x7f5d29426490>,\n                     'max_depth': [10, 12],\n                     'num_leaves': <scipy.stats._distn_infrastructure.rv_frozen object at 0x7f5d29426710>},\n         sampling_seed=0)"},"metadata":{}}]},{"cell_type":"code","source":"model.estimator_, model.best_params_, model.best_score_, model.n_features_","metadata":{"execution":{"iopub.status.busy":"2022-01-01T12:16:42.205176Z","iopub.execute_input":"2022-01-01T12:16:42.205439Z","iopub.status.idle":"2022-01-01T12:16:42.215355Z","shell.execute_reply.started":"2022-01-01T12:16:42.205404Z","shell.execute_reply":"2022-01-01T12:16:42.214732Z"},"trusted":true},"execution_count":35,"outputs":[{"execution_count":35,"output_type":"execute_result","data":{"text/plain":"(XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,\n              colsample_bynode=1, colsample_bytree=1, enable_categorical=False,\n              gamma=0, gpu_id=-1, importance_type=None,\n              interaction_constraints='', learning_rate=0.1350674222191923,\n              max_delta_step=0, max_depth=10, min_child_weight=1, missing=nan,\n              monotone_constraints='()', n_estimators=150, n_jobs=-1,\n              num_leaves=38, num_parallel_tree=1, predictor='auto',\n              random_state=0, reg_alpha=0, reg_lambda=1, scale_pos_weight=1,\n              subsample=1, tree_method='exact', validate_parameters=1,\n              verbosity=0),\n {'learning_rate': 0.1350674222191923, 'num_leaves': 38, 'max_depth': 10},\n 0.250552,\n 10)"},"metadata":{}}]},{"cell_type":"code","source":"(model.score(X_regr_valid, y_regr_valid), \n model.predict(X_regr_valid).shape, \n model.transform(X_regr_valid).shape)","metadata":{"execution":{"iopub.status.busy":"2022-01-01T12:16:42.216398Z","iopub.execute_input":"2022-01-01T12:16:42.216642Z","iopub.status.idle":"2022-01-01T12:16:42.242381Z","shell.execute_reply.started":"2022-01-01T12:16:42.216606Z","shell.execute_reply":"2022-01-01T12:16:42.241879Z"},"trusted":true},"execution_count":36,"outputs":[{"execution_count":36,"output_type":"execute_result","data":{"text/plain":"(0.7488873349293266, (1800,), (1800, 10))"},"metadata":{}}]},{"cell_type":"code","source":"### HYPERPARAM TUNING WITH HYPEROPT + RECURSIVE FEATURE ADDITION (RFA) ###\n\nmodel = BoostRFA(\n    regr_xgb, param_grid=param_dist_hyperopt, min_features_to_select=1, step=1,\n    n_iter=8, sampling_seed=0\n)\nmodel.fit(\n    X_regr_train, y_regr_train, trials=Trials(), \n    eval_set=[(X_regr_valid, y_regr_valid)], early_stopping_rounds=6, verbose=0\n)","metadata":{"execution":{"iopub.status.busy":"2022-01-01T12:16:42.245655Z","iopub.execute_input":"2022-01-01T12:16:42.247219Z","iopub.status.idle":"2022-01-01T12:26:08.685124Z","shell.execute_reply.started":"2022-01-01T12:16:42.247188Z","shell.execute_reply":"2022-01-01T12:26:08.684364Z"},"trusted":true},"execution_count":37,"outputs":[{"name":"stdout","text":"\n8 trials detected for ('max_depth', 'learning_rate', 'colsample_bytree')\n\ntrial: 0001 ### iterations: 00149 ### eval_score: 0.26412\ntrial: 0002 ### iterations: 00080 ### eval_score: 0.25357\ntrial: 0003 ### iterations: 00054 ### eval_score: 0.26123\ntrial: 0004 ### iterations: 00149 ### eval_score: 0.2801\ntrial: 0005 ### iterations: 00149 ### eval_score: 0.27046\ntrial: 0006 ### iterations: 00149 ### eval_score: 0.24789\ntrial: 0007 ### iterations: 00054 ### eval_score: 0.25928\ntrial: 0008 ### iterations: 00140 ### eval_score: 0.27284\n","output_type":"stream"},{"execution_count":37,"output_type":"execute_result","data":{"text/plain":"BoostRFA(estimator=XGBRegressor(base_score=None, booster=None,\n                                colsample_bylevel=None, colsample_bynode=None,\n                                colsample_bytree=None, enable_categorical=False,\n                                gamma=None, gpu_id=None, importance_type=None,\n                                interaction_constraints=None,\n                                learning_rate=None, max_delta_step=None,\n                                max_depth=None, min_child_weight=None,\n                                missing=nan, monotone_constraints=None,\n                                n_estimato...\n                                random_state=0, reg_alpha=None, reg_lambda=None,\n                                scale_pos_weight=None, subsample=None,\n                                tree_method=None, validate_parameters=None,\n                                verbosity=0),\n         min_features_to_select=1, n_iter=8,\n         param_grid={'colsample_bytree': <hyperopt.pyll.base.Apply object at 0x7f5d29426890>,\n                     'learning_rate': <hyperopt.pyll.base.Apply object at 0x7f5d29426a50>,\n                     'max_depth': <hyperopt.pyll.base.Apply object at 0x7f5d29426790>},\n         sampling_seed=0)"},"metadata":{}}]},{"cell_type":"code","source":"model.estimator_, model.best_params_, model.best_score_, model.n_features_","metadata":{"execution":{"iopub.status.busy":"2022-01-01T12:26:08.686184Z","iopub.execute_input":"2022-01-01T12:26:08.686931Z","iopub.status.idle":"2022-01-01T12:26:08.696854Z","shell.execute_reply.started":"2022-01-01T12:26:08.686898Z","shell.execute_reply":"2022-01-01T12:26:08.696004Z"},"trusted":true},"execution_count":38,"outputs":[{"execution_count":38,"output_type":"execute_result","data":{"text/plain":"(XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,\n              colsample_bynode=1, colsample_bytree=0.7597292534356749,\n              enable_categorical=False, gamma=0, gpu_id=-1, importance_type=None,\n              interaction_constraints='', learning_rate=0.059836658149176665,\n              max_delta_step=0, max_depth=16, min_child_weight=1, missing=nan,\n              monotone_constraints='()', n_estimators=150, n_jobs=-1,\n              num_parallel_tree=1, predictor='auto', random_state=0, reg_alpha=0,\n              reg_lambda=1, scale_pos_weight=1, subsample=1, tree_method='exact',\n              validate_parameters=1, verbosity=0),\n {'colsample_bytree': 0.7597292534356749,\n  'learning_rate': 0.059836658149176665,\n  'max_depth': 16},\n 0.247887,\n 8)"},"metadata":{}}]},{"cell_type":"code","source":"(model.score(X_regr_valid, y_regr_valid), \n model.predict(X_regr_valid).shape, \n model.transform(X_regr_valid).shape)","metadata":{"execution":{"iopub.status.busy":"2022-01-01T12:26:08.697934Z","iopub.execute_input":"2022-01-01T12:26:08.698155Z","iopub.status.idle":"2022-01-01T12:26:08.736781Z","shell.execute_reply.started":"2022-01-01T12:26:08.698128Z","shell.execute_reply":"2022-01-01T12:26:08.736145Z"},"trusted":true},"execution_count":39,"outputs":[{"execution_count":39,"output_type":"execute_result","data":{"text/plain":"(0.7542006308661441, (1800,), (1800, 8))"},"metadata":{}}]},{"cell_type":"markdown","source":"# Hyperparameters Tuning + Features Selection with SHAP","metadata":{}},{"cell_type":"code","source":"### HYPERPARAM TUNING WITH GRID-SEARCH + BORUTA SHAP ###\n\nmodel = BoostBoruta(\n    clf_xgb, param_grid=param_grid, max_iter=200, perc=100,\n    importance_type='shap_importances', train_importance=False\n)\nmodel.fit(X_clf_train, y_clf_train, eval_set=[(X_clf_valid, y_clf_valid)], early_stopping_rounds=6, verbose=0)","metadata":{"scrolled":true,"execution":{"iopub.status.busy":"2022-01-01T12:26:08.740222Z","iopub.execute_input":"2022-01-01T12:26:08.741848Z","iopub.status.idle":"2022-01-01T12:56:13.612807Z","shell.execute_reply.started":"2022-01-01T12:26:08.741813Z","shell.execute_reply":"2022-01-01T12:56:13.611991Z"},"trusted":true},"execution_count":40,"outputs":[{"name":"stdout","text":"\n8 trials detected for ('learning_rate', 'num_leaves', 'max_depth')\n\ntrial: 0001 ### iterations: 00024 ### eval_score: 0.20151\ntrial: 0002 ### iterations: 00020 ### eval_score: 0.20877\ntrial: 0003 ### iterations: 00024 ### eval_score: 0.20151\ntrial: 0004 ### iterations: 00020 ### eval_score: 0.20877\ntrial: 0005 ### iterations: 00048 ### eval_score: 0.20401\ntrial: 0006 ### iterations: 00048 ### eval_score: 0.20575\ntrial: 0007 ### iterations: 00048 ### eval_score: 0.20401\ntrial: 0008 ### iterations: 00048 ### eval_score: 0.20575\n","output_type":"stream"},{"execution_count":40,"output_type":"execute_result","data":{"text/plain":"BoostBoruta(estimator=XGBClassifier(base_score=None, booster=None,\n                                    colsample_bylevel=None,\n                                    colsample_bynode=None,\n                                    colsample_bytree=None,\n                                    enable_categorical=False, gamma=None,\n                                    gpu_id=None, importance_type=None,\n                                    interaction_constraints=None,\n                                    learning_rate=None, max_delta_step=None,\n                                    max_depth=None, min_child_weight=None,\n                                    missing=nan, monotone_constraints=None,\n                                    n_estimators=150, n_jobs=-1,\n                                    num_parallel_tree=None, predictor=None,\n                                    random_state=0, reg_alpha=None,\n                                    reg_lambda=None, scale_pos_weight=None,\n                                    subsample=None, tree_method=None,\n                                    validate_parameters=None, verbosity=0),\n            importance_type='shap_importances', max_iter=200,\n            param_grid={'learning_rate': [0.2, 0.1], 'max_depth': [10, 12],\n                        'num_leaves': [25, 35]},\n            train_importance=False)"},"metadata":{}}]},{"cell_type":"code","source":"model.estimator_, model.best_params_, model.best_score_, model.n_features_","metadata":{"execution":{"iopub.status.busy":"2022-01-01T12:56:13.617168Z","iopub.execute_input":"2022-01-01T12:56:13.617372Z","iopub.status.idle":"2022-01-01T12:56:13.626563Z","shell.execute_reply.started":"2022-01-01T12:56:13.617349Z","shell.execute_reply":"2022-01-01T12:56:13.626036Z"},"trusted":true},"execution_count":41,"outputs":[{"execution_count":41,"output_type":"execute_result","data":{"text/plain":"(XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,\n               colsample_bynode=1, colsample_bytree=1, enable_categorical=False,\n               gamma=0, gpu_id=-1, importance_type=None,\n               interaction_constraints='', learning_rate=0.2, max_delta_step=0,\n               max_depth=10, min_child_weight=1, missing=nan,\n               monotone_constraints='()', n_estimators=150, n_jobs=-1,\n               num_leaves=25, num_parallel_tree=1, predictor='auto',\n               random_state=0, reg_alpha=0, reg_lambda=1, scale_pos_weight=1,\n               subsample=1, tree_method='exact', validate_parameters=1,\n               verbosity=0),\n {'learning_rate': 0.2, 'num_leaves': 25, 'max_depth': 10},\n 0.201509,\n 10)"},"metadata":{}}]},{"cell_type":"code","source":"(model.score(X_clf_valid, y_clf_valid), \n model.predict(X_clf_valid).shape, \n model.transform(X_clf_valid).shape,\n model.predict_proba(X_clf_valid).shape)","metadata":{"execution":{"iopub.status.busy":"2022-01-01T12:56:13.627454Z","iopub.execute_input":"2022-01-01T12:56:13.627825Z","iopub.status.idle":"2022-01-01T12:56:13.665907Z","shell.execute_reply.started":"2022-01-01T12:56:13.627797Z","shell.execute_reply":"2022-01-01T12:56:13.664686Z"},"trusted":true},"execution_count":42,"outputs":[{"execution_count":42,"output_type":"execute_result","data":{"text/plain":"(0.9144444444444444, (1800,), (1800, 10), (1800, 2))"},"metadata":{}}]},{"cell_type":"code","source":"### HYPERPARAM TUNING WITH RANDOM-SEARCH + RECURSIVE FEATURE ELIMINATION (RFE) SHAP ###\n\nmodel = BoostRFE(\n    regr_xgb, param_grid=param_dist, min_features_to_select=1, step=1,\n    n_iter=8, sampling_seed=0,\n    importance_type='shap_importances', train_importance=False\n)\nmodel.fit(X_regr_train, y_regr_train, eval_set=[(X_regr_valid, y_regr_valid)], early_stopping_rounds=6, verbose=0)","metadata":{"execution":{"iopub.status.busy":"2022-01-01T12:56:13.667149Z","iopub.execute_input":"2022-01-01T12:56:13.667539Z","iopub.status.idle":"2022-01-01T13:08:38.854835Z","shell.execute_reply.started":"2022-01-01T12:56:13.667509Z","shell.execute_reply":"2022-01-01T13:08:38.854142Z"},"trusted":true},"execution_count":43,"outputs":[{"name":"stdout","text":"\n8 trials detected for ('learning_rate', 'num_leaves', 'max_depth')\n\ntrial: 0001 ### iterations: 00021 ### eval_score: 0.25941\ntrial: 0002 ### iterations: 00064 ### eval_score: 0.25075\ntrial: 0003 ### iterations: 00075 ### eval_score: 0.25493\ntrial: 0004 ### iterations: 00084 ### eval_score: 0.25002\ntrial: 0005 ### iterations: 00093 ### eval_score: 0.25609\ntrial: 0006 ### iterations: 00039 ### eval_score: 0.2573\ntrial: 0007 ### iterations: 00074 ### eval_score: 0.25348\ntrial: 0008 ### iterations: 00032 ### eval_score: 0.2583\n","output_type":"stream"},{"execution_count":43,"output_type":"execute_result","data":{"text/plain":"BoostRFE(estimator=XGBRegressor(base_score=None, booster=None,\n                                colsample_bylevel=None, colsample_bynode=None,\n                                colsample_bytree=None, enable_categorical=False,\n                                gamma=None, gpu_id=None, importance_type=None,\n                                interaction_constraints=None,\n                                learning_rate=None, max_delta_step=None,\n                                max_depth=None, min_child_weight=None,\n                                missing=nan, monotone_constraints=None,\n                                n_estimato...\n                                tree_method=None, validate_parameters=None,\n                                verbosity=0),\n         importance_type='shap_importances', min_features_to_select=1, n_iter=8,\n         param_grid={'learning_rate': <scipy.stats._distn_infrastructure.rv_frozen object at 0x7f5d29426490>,\n                     'max_depth': [10, 12],\n                     'num_leaves': <scipy.stats._distn_infrastructure.rv_frozen object at 0x7f5d29426710>},\n         sampling_seed=0, train_importance=False)"},"metadata":{}}]},{"cell_type":"code","source":"model.estimator_, model.best_params_, model.best_score_, model.n_features_","metadata":{"execution":{"iopub.status.busy":"2022-01-01T13:08:38.855807Z","iopub.execute_input":"2022-01-01T13:08:38.856007Z","iopub.status.idle":"2022-01-01T13:08:38.866421Z","shell.execute_reply.started":"2022-01-01T13:08:38.855982Z","shell.execute_reply":"2022-01-01T13:08:38.865771Z"},"trusted":true},"execution_count":44,"outputs":[{"execution_count":44,"output_type":"execute_result","data":{"text/plain":"(XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,\n              colsample_bynode=1, colsample_bytree=1, enable_categorical=False,\n              gamma=0, gpu_id=-1, importance_type=None,\n              interaction_constraints='', learning_rate=0.1669837381562427,\n              max_delta_step=0, max_depth=10, min_child_weight=1, missing=nan,\n              monotone_constraints='()', n_estimators=150, n_jobs=-1,\n              num_leaves=25, num_parallel_tree=1, predictor='auto',\n              random_state=0, reg_alpha=0, reg_lambda=1, scale_pos_weight=1,\n              subsample=1, tree_method='exact', validate_parameters=1,\n              verbosity=0),\n {'learning_rate': 0.1669837381562427, 'num_leaves': 25, 'max_depth': 10},\n 0.250021,\n 11)"},"metadata":{}}]},{"cell_type":"code","source":"(model.score(X_regr_valid, y_regr_valid), \n model.predict(X_regr_valid).shape, \n model.transform(X_regr_valid).shape)","metadata":{"execution":{"iopub.status.busy":"2022-01-01T13:08:38.867249Z","iopub.execute_input":"2022-01-01T13:08:38.867888Z","iopub.status.idle":"2022-01-01T13:08:38.887178Z","shell.execute_reply.started":"2022-01-01T13:08:38.867860Z","shell.execute_reply":"2022-01-01T13:08:38.886666Z"},"trusted":true},"execution_count":45,"outputs":[{"execution_count":45,"output_type":"execute_result","data":{"text/plain":"(0.7499501426259738, (1800,), (1800, 11))"},"metadata":{}}]},{"cell_type":"code","source":"### HYPERPARAM TUNING WITH HYPEROPT + RECURSIVE FEATURE ADDITION (RFA) SHAP ###\n\nmodel = BoostRFA(\n    regr_xgb, param_grid=param_dist_hyperopt, min_features_to_select=1, step=1,\n    n_iter=8, sampling_seed=0,\n    importance_type='shap_importances', train_importance=False\n)\nmodel.fit(\n    X_regr_train, y_regr_train, trials=Trials(), \n    eval_set=[(X_regr_valid, y_regr_valid)], early_stopping_rounds=6, verbose=0\n)","metadata":{"execution":{"iopub.status.busy":"2022-01-01T13:08:38.890197Z","iopub.execute_input":"2022-01-01T13:08:38.891876Z","iopub.status.idle":"2022-01-01T13:41:32.886109Z","shell.execute_reply.started":"2022-01-01T13:08:38.891845Z","shell.execute_reply":"2022-01-01T13:41:32.885257Z"},"trusted":true},"execution_count":46,"outputs":[{"name":"stdout","text":"\n8 trials detected for ('max_depth', 'learning_rate', 'colsample_bytree')\n\ntrial: 0001 ### iterations: 00149 ### eval_score: 0.25811\ntrial: 0002 ### iterations: 00078 ### eval_score: 0.25554\ntrial: 0003 ### iterations: 00059 ### eval_score: 0.26658\ntrial: 0004 ### iterations: 00149 ### eval_score: 0.27356\ntrial: 0005 ### iterations: 00149 ### eval_score: 0.26426\ntrial: 0006 ### iterations: 00149 ### eval_score: 0.25537\ntrial: 0007 ### iterations: 00052 ### eval_score: 0.26107\ntrial: 0008 ### iterations: 00137 ### eval_score: 0.27787\n","output_type":"stream"},{"execution_count":46,"output_type":"execute_result","data":{"text/plain":"BoostRFA(estimator=XGBRegressor(base_score=None, booster=None,\n                                colsample_bylevel=None, colsample_bynode=None,\n                                colsample_bytree=None, enable_categorical=False,\n                                gamma=None, gpu_id=None, importance_type=None,\n                                interaction_constraints=None,\n                                learning_rate=None, max_delta_step=None,\n                                max_depth=None, min_child_weight=None,\n                                missing=nan, monotone_constraints=None,\n                                n_estimato...\n                                tree_method=None, validate_parameters=None,\n                                verbosity=0),\n         importance_type='shap_importances', min_features_to_select=1, n_iter=8,\n         param_grid={'colsample_bytree': <hyperopt.pyll.base.Apply object at 0x7f5d29426890>,\n                     'learning_rate': <hyperopt.pyll.base.Apply object at 0x7f5d29426a50>,\n                     'max_depth': <hyperopt.pyll.base.Apply object at 0x7f5d29426790>},\n         sampling_seed=0, train_importance=False)"},"metadata":{}}]},{"cell_type":"code","source":"model.estimator_, model.best_params_, model.best_score_, model.n_features_","metadata":{"execution":{"iopub.status.busy":"2022-01-01T13:41:32.887300Z","iopub.execute_input":"2022-01-01T13:41:32.887495Z","iopub.status.idle":"2022-01-01T13:41:32.897203Z","shell.execute_reply.started":"2022-01-01T13:41:32.887472Z","shell.execute_reply":"2022-01-01T13:41:32.896455Z"},"trusted":true},"execution_count":47,"outputs":[{"execution_count":47,"output_type":"execute_result","data":{"text/plain":"(XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,\n              colsample_bynode=1, colsample_bytree=0.7597292534356749,\n              enable_categorical=False, gamma=0, gpu_id=-1, importance_type=None,\n              interaction_constraints='', learning_rate=0.059836658149176665,\n              max_delta_step=0, max_depth=16, min_child_weight=1, missing=nan,\n              monotone_constraints='()', n_estimators=150, n_jobs=-1,\n              num_parallel_tree=1, predictor='auto', random_state=0, reg_alpha=0,\n              reg_lambda=1, scale_pos_weight=1, subsample=1, tree_method='exact',\n              validate_parameters=1, verbosity=0),\n {'colsample_bytree': 0.7597292534356749,\n  'learning_rate': 0.059836658149176665,\n  'max_depth': 16},\n 0.255374,\n 11)"},"metadata":{}}]},{"cell_type":"code","source":"(model.score(X_regr_valid, y_regr_valid), \n model.predict(X_regr_valid).shape, \n model.transform(X_regr_valid).shape)","metadata":{"execution":{"iopub.status.busy":"2022-01-01T13:41:32.898201Z","iopub.execute_input":"2022-01-01T13:41:32.898493Z","iopub.status.idle":"2022-01-01T13:41:32.931801Z","shell.execute_reply.started":"2022-01-01T13:41:32.898469Z","shell.execute_reply":"2022-01-01T13:41:32.931131Z"},"trusted":true},"execution_count":48,"outputs":[{"execution_count":48,"output_type":"execute_result","data":{"text/plain":"(0.7391290836488575, (1800,), (1800, 11))"},"metadata":{}}]},{"cell_type":"markdown","source":"# CUSTOM EVAL METRIC SUPPORT","metadata":{}},{"cell_type":"code","source":"from sklearn.metrics import roc_auc_score\n\ndef AUC(y_hat, dtrain):\n    y_true = dtrain.get_label()\n    return 'auc', roc_auc_score(y_true, y_hat)","metadata":{"execution":{"iopub.status.busy":"2022-01-01T13:41:32.932773Z","iopub.execute_input":"2022-01-01T13:41:32.932979Z","iopub.status.idle":"2022-01-01T13:41:32.940277Z","shell.execute_reply.started":"2022-01-01T13:41:32.932952Z","shell.execute_reply":"2022-01-01T13:41:32.939659Z"},"trusted":true},"execution_count":49,"outputs":[]},{"cell_type":"code","source":"model = BoostRFE(\n    clf_xgb, \n    param_grid=param_grid, min_features_to_select=1, step=1,\n    greater_is_better=True\n)\nmodel.fit(\n    X_clf_train, y_clf_train, \n    eval_set=[(X_clf_valid, y_clf_valid)], early_stopping_rounds=6, verbose=0,\n    eval_metric=AUC\n)","metadata":{"execution":{"iopub.status.busy":"2022-01-01T13:41:32.943194Z","iopub.execute_input":"2022-01-01T13:41:32.944797Z","iopub.status.idle":"2022-01-01T13:43:50.574377Z","shell.execute_reply.started":"2022-01-01T13:41:32.944765Z","shell.execute_reply":"2022-01-01T13:43:50.573628Z"},"trusted":true},"execution_count":50,"outputs":[{"name":"stdout","text":"\n8 trials detected for ('learning_rate', 'num_leaves', 'max_depth')\n\ntrial: 0001 ### iterations: 00017 ### eval_score: 0.9757\ntrial: 0002 ### iterations: 00026 ### eval_score: 0.97632\ntrial: 0003 ### iterations: 00017 ### eval_score: 0.9757\ntrial: 0004 ### iterations: 00026 ### eval_score: 0.97632\ntrial: 0005 ### iterations: 00033 ### eval_score: 0.97594\ntrial: 0006 ### iterations: 00034 ### eval_score: 0.97577\ntrial: 0007 ### iterations: 00033 ### eval_score: 0.97594\ntrial: 0008 ### iterations: 00034 ### eval_score: 0.97577\n","output_type":"stream"},{"execution_count":50,"output_type":"execute_result","data":{"text/plain":"BoostRFE(estimator=XGBClassifier(base_score=None, booster=None,\n                                 colsample_bylevel=None, colsample_bynode=None,\n                                 colsample_bytree=None,\n                                 enable_categorical=False, gamma=None,\n                                 gpu_id=None, importance_type=None,\n                                 interaction_constraints=None,\n                                 learning_rate=None, max_delta_step=None,\n                                 max_depth=None, min_child_weight=None,\n                                 missing=nan, monotone_constraints=None,\n                                 n_estimators=150, n_jobs=-1,\n                                 num_parallel_tree=None, predictor=None,\n                                 random_state=0, reg_alpha=None,\n                                 reg_lambda=None, scale_pos_weight=None,\n                                 subsample=None, tree_method=None,\n                                 validate_parameters=None, verbosity=0),\n         greater_is_better=True, min_features_to_select=1,\n         param_grid={'learning_rate': [0.2, 0.1], 'max_depth': [10, 12],\n                     'num_leaves': [25, 35]})"},"metadata":{}}]}]}

================================================
FILE: requirements.txt
================================================
numpy
scipy
scikit-learn>=0.24.1
shap>=0.39.0
hyperopt==0.2.5

================================================
FILE: setup.py
================================================
import pathlib
from setuptools import setup, find_packages

HERE = pathlib.Path(__file__).parent

VERSION = '0.2.7'
PACKAGE_NAME = 'shap-hypetune'
AUTHOR = 'Marco Cerliani'
AUTHOR_EMAIL = 'cerlymarco@gmail.com'
URL = 'https://github.com/cerlymarco/shap-hypetune'

LICENSE = 'MIT'
DESCRIPTION = 'A python package for simultaneous Hyperparameters Tuning and Features Selection for Gradient Boosting Models.'
LONG_DESCRIPTION = (HERE / "README.md").read_text()
LONG_DESC_TYPE = "text/markdown"

INSTALL_REQUIRES = [
    'numpy',
    'scipy',
    'scikit-learn>=0.24.1',
    'shap>=0.39.0',
    'hyperopt==0.2.5'
]

setup(name=PACKAGE_NAME,
      version=VERSION,
      description=DESCRIPTION,
      long_description=LONG_DESCRIPTION,
      long_description_content_type=LONG_DESC_TYPE,
      author=AUTHOR,
      license=LICENSE,
      author_email=AUTHOR_EMAIL,
      url=URL,
      install_requires=INSTALL_REQUIRES,
      python_requires='>=3',
      packages=find_packages()
      )

================================================
FILE: shaphypetune/__init__.py
================================================
from .utils import *
from ._classes import *
from .shaphypetune import *

================================================
FILE: shaphypetune/_classes.py
================================================
import io
import contextlib
import warnings
import numpy as np
import scipy as sp
from copy import deepcopy

from sklearn.base import clone
from sklearn.utils.validation import check_is_fitted
from sklearn.base import BaseEstimator, TransformerMixin

from joblib import Parallel, delayed
from hyperopt import fmin, tpe

from .utils import ParameterSampler, _check_param, _check_boosting
from .utils import _set_categorical_indexes, _get_categorical_support
from .utils import _feature_importances, _shap_importances


class _BoostSearch(BaseEstimator):
    """Base class for BoostSearch meta-estimator.

    Warning: This class should not be used directly. Use derived classes
    instead.
    """

    def __init__(self):
        pass

    def _validate_param_grid(self, fit_params):
        """Private method to validate fitting parameters."""

        if not isinstance(self.param_grid, dict):
            raise ValueError("Pass param_grid in dict format.")
        self._param_grid = self.param_grid.copy()

        for p_k, p_v in self._param_grid.items():
            self._param_grid[p_k] = _check_param(p_v)

        if 'eval_set' not in fit_params:
            raise ValueError(
                "When tuning parameters, at least "
                "a evaluation set is required.")

        self._eval_score = np.argmax if self.greater_is_better else np.argmin
        self._score_sign = -1 if self.greater_is_better else 1

        rs = ParameterSampler(
            n_iter=self.n_iter,
            param_distributions=self._param_grid,
            random_state=self.sampling_seed
        )
        self._param_combi, self._tuning_type = rs.sample()
        self._trial_id = 1

        if self.verbose > 0:
            n_trials = self.n_iter if self._tuning_type is 'hyperopt' \
                else len(self._param_combi)
            print("\n{} trials detected for {}\n".format(
                n_trials, tuple(self.param_grid.keys())))

    def _fit(self, X, y, fit_params, params=None):
        """Private method to fit a single boosting model and extract results."""

        model = self._build_model(params)
        if isinstance(model, _BoostSelector):
            model.fit(X=X, y=y, **fit_params)
        else:
            with contextlib.redirect_stdout(io.StringIO()):
                model.fit(X=X, y=y, **fit_params)

        results = {'params': params, 'status': 'ok'}

        if isinstance(model, _BoostSelector):
            results['booster'] = model.estimator_
            results['model'] = model
        else:
            results['booster'] = model
            results['model'] = None

        if 'eval_set' not in fit_params:
            return results

        if self.boost_type_ == 'XGB':
            # w/ eval_set and w/ early_stopping_rounds
            if hasattr(results['booster'], 'best_score'):
                results['iterations'] = results['booster'].best_iteration
            # w/ eval_set and w/o early_stopping_rounds
            else:
                valid_id = list(results['booster'].evals_result_.keys())[-1]
                eval_metric = list(results['booster'].evals_result_[valid_id])[-1]
                results['iterations'] = \
                    len(results['booster'].evals_result_[valid_id][eval_metric])
        else:
            # w/ eval_set and w/ early_stopping_rounds
            if results['booster'].best_iteration_ is not None:
                results['iterations'] = results['booster'].best_iteration_
            # w/ eval_set and w/o early_stopping_rounds
            else:
                valid_id = list(results['booster'].evals_result_.keys())[-1]
                eval_metric = list(results['booster'].evals_result_[valid_id])[-1]
                results['iterations'] = \
                    len(results['booster'].evals_result_[valid_id][eval_metric])

        if self.boost_type_ == 'XGB':
            # w/ eval_set and w/ early_stopping_rounds
            if hasattr(results['booster'], 'best_score'):
                results['loss'] = results['booster'].best_score
            # w/ eval_set and w/o early_stopping_rounds
            else:
                valid_id = list(results['booster'].evals_result_.keys())[-1]
                eval_metric = list(results['booster'].evals_result_[valid_id])[-1]
                results['loss'] = \
                    results['booster'].evals_result_[valid_id][eval_metric][-1]
        else:
            valid_id = list(results['booster'].best_score_.keys())[-1]
            eval_metric = list(results['booster'].best_score_[valid_id])[-1]
            results['loss'] = results['booster'].best_score_[valid_id][eval_metric]

        if params is not None:
            if self.verbose > 0:
                msg = "trial: {} ### iterations: {} ### eval_score: {}".format(
                    str(self._trial_id).zfill(4),
                    str(results['iterations']).zfill(5),
                    round(results['loss'], 5)
                )
                print(msg)

            self._trial_id += 1
            results['loss'] *= self._score_sign

        return results

    def fit(self, X, y, trials=None, **fit_params):
        """Fit the provided boosting algorithm while searching the best subset
        of features (according to the selected strategy) and choosing the best
        parameters configuration (if provided).

        It takes the same arguments available in the estimator fit.

        Parameters
        ----------
        X : array-like of shape (n_samples, n_features)
            The training input samples.

        y : array-like of shape (n_samples,)
            Target values.

        trials : hyperopt.Trials() object, default=None
            A hyperopt trials object, used to store intermediate results for all
            optimization runs. Effective (and required) only when hyperopt
            parameter searching is computed.

        **fit_params : Additional fitting arguments.

        Returns
        -------
        self : object
        """

        self.boost_type_ = _check_boosting(self.estimator)

        if self.param_grid is None:
            results = self._fit(X, y, fit_params)

            for v in vars(results['model']):
                if v.endswith("_") and not v.startswith("__"):
                    setattr(self, str(v), getattr(results['model'], str(v)))

        else:
            self._validate_param_grid(fit_params)

            if self._tuning_type == 'hyperopt':
                if trials is None:
                    raise ValueError(
                        "trials must be not None when using hyperopt."
                    )

                search = fmin(
                    fn=lambda p: self._fit(
                        params=p, X=X, y=y, fit_params=fit_params
                    ),
                    space=self._param_combi, algo=tpe.suggest,
                    max_evals=self.n_iter, trials=trials,
                    rstate=np.random.RandomState(self.sampling_seed),
                    show_progressbar=False, verbose=0
                )
                all_results = trials.results

            else:
                all_results = Parallel(
                    n_jobs=self.n_jobs, verbose=self.verbose * int(bool(self.n_jobs))
                )(delayed(self._fit)(X, y, fit_params, params)
                  for params in self._param_combi)

            # extract results from parallel loops
            self.trials_, self.iterations_, self.scores_, models = [], [], [], []
            for job_res in all_results:
                self.trials_.append(job_res['params'])
                self.iterations_.append(job_res['iterations'])
                self.scores_.append(self._score_sign * job_res['loss'])
                if isinstance(job_res['model'], _BoostSelector):
                    models.append(job_res['model'])
                else:
                    models.append(job_res['booster'])

            # get the best
            id_best = self._eval_score(self.scores_)
            self.best_params_ = self.trials_[id_best]
            self.best_iter_ = self.iterations_[id_best]
            self.best_score_ = self.scores_[id_best]
            self.estimator_ = models[id_best]

            for v in vars(models[id_best]):
                if v.endswith("_") and not v.startswith("__"):
                    setattr(self, str(v), getattr(models[id_best], str(v)))

        return self

    def predict(self, X, **predict_params):
        """Predict X.

        Parameters
        ----------
        X : array-like of shape (n_samples, n_features)
            Samples.

        **predict_params : Additional predict arguments.

        Returns
        -------
        pred : ndarray of shape (n_samples,)
            The predicted values.
        """

        check_is_fitted(self)

        if hasattr(self, 'transform'):
            X = self.transform(X)

        return self.estimator_.predict(X, **predict_params)

    def predict_proba(self, X, **predict_params):
        """Predict X probabilities.

        Parameters
        ----------
        X : array-like of shape (n_samples, n_features)
            Samples.

        **predict_params : Additional predict arguments.

        Returns
        -------
        pred : ndarray of shape (n_samples, n_classes)
            The predicted values.
        """

        check_is_fitted(self)

        # raise original AttributeError
        getattr(self.estimator_, 'predict_proba')

        if hasattr(self, 'transform'):
            X = self.transform(X)

        return self.estimator_.predict_proba(X, **predict_params)

    def score(self, X, y, sample_weight=None):
        """Return the score on the given test data and labels.

        Parameters
        ----------
        X : array-like of shape (n_samples, n_features)
            Test samples.

        y : array-like of shape (n_samples,)
            True values for X.

        sample_weight : array-like of shape (n_samples,), default=None
            Sample weights.

        Returns
        -------
        score : float
            Accuracy for classification, R2 for regression.
        """

        check_is_fitted(self)

        if hasattr(self, 'transform'):
            X = self.transform(X)

        return self.estimator_.score(X, y, sample_weight=sample_weight)


class _BoostSelector(BaseEstimator, TransformerMixin):
    """Base class for feature selection meta-estimator.

    Warning: This class should not be used directly. Use derived classes
    instead.
    """

    def __init__(self):
        pass

    def transform(self, X):
        """Reduces the input X to the features selected by Boruta.

        Parameters
        ----------
        X : array-like of shape (n_samples, n_features)
            Samples.

        Returns
        -------
        X : array-like of shape (n_samples, n_features_)
            The input samples with only the selected features by Boruta.
        """

        check_is_fitted(self)

        shapes = np.shape(X)
        if len(shapes) != 2:
            raise ValueError("X must be 2D.")

        if shapes[1] != self.support_.shape[0]:
            raise ValueError(
                "Expected {} features, received {}.".format(
                    self.support_.shape[0], shapes[1]))

        if isinstance(X, np.ndarray):
            return X[:, self.support_]
        elif hasattr(X, 'loc'):
            return X.loc[:, self.support_]
        else:
            raise ValueError("Data type not understood.")

    def get_support(self, indices=False):
        """Get a mask, or integer index, of the features selected.

        Parameters
        ----------
        indices : bool, default=False
            If True, the return value will be an array of integers, rather
            than a boolean mask.

        Returns
        -------
        support : array
            An index that selects the retained features from a feature vector.
            If `indices` is False, this is a boolean array of shape
            [# input features], in which an element is True iff its
            corresponding feature is selected for retention. If `indices` is
            True, this is an integer array of shape [# output features] whose
            values are indices into the input feature vector.
        """

        check_is_fitted(self)

        mask = self.support_
        return mask if not indices else np.where(mask)[0]


class _Boruta(_BoostSelector):
    """Base class for BoostBoruta meta-estimator.

    Warning: This class should not be used directly. Use derived classes
    instead.

    Notes
    -----
    The code for the Boruta algorithm is inspired and improved from:
    https://github.com/scikit-learn-contrib/boruta_py
    """

    def __init__(self,
                 estimator, *,
                 perc=100,
                 alpha=0.05,
                 max_iter=100,
                 early_stopping_boruta_rounds=None,
                 importance_type='feature_importances',
                 train_importance=True,
                 verbose=0):

        self.estimator = estimator
        self.perc = perc
        self.alpha = alpha
        self.max_iter = max_iter
        self.early_stopping_boruta_rounds = early_stopping_boruta_rounds
        self.importance_type = importance_type
        self.train_importance = train_importance
        self.verbose = verbose

    def _create_X(self, X, feat_id_real):
        """Private method to add shadow features to the original ones. """

        if isinstance(X, np.ndarray):
            X_real = X[:, feat_id_real].copy()
            X_sha = X_real.copy()
            X_sha = np.apply_along_axis(self._random_state.permutation, 0, X_sha)

            X = np.hstack((X_real, X_sha))

        elif hasattr(X, 'iloc'):
            X_real = X.iloc[:, feat_id_real].copy()
            X_sha = X_real.copy()
            X_sha = X_sha.apply(self._random_state.permutation)
            X_sha = X_sha.astype(X_real.dtypes)

            X = X_real.join(X_sha, rsuffix='_SHA')

        else:
            raise ValueError("Data type not understood.")

        return X

    def _check_fit_params(self, fit_params, feat_id_real=None):
        """Private method to validate and check fit_params."""

        _fit_params = deepcopy(fit_params)
        estimator = clone(self.estimator)
        # add here possible estimator checks in each iteration

        _fit_params = _set_categorical_indexes(
            self.support_, self._cat_support, _fit_params, duplicate=True)

        if feat_id_real is None:  # final model fit
            if 'eval_set' in _fit_params:
                _fit_params['eval_set'] = list(map(lambda x: (
                    self.transform(x[0]), x[1]
                ), _fit_params['eval_set']))
        else:
            if 'eval_set' in _fit_params:  # iterative model fit
                _fit_params['eval_set'] = list(map(lambda x: (
                    self._create_X(x[0], feat_id_real), x[1]
                ), _fit_params['eval_set']))

        if 'feature_name' in _fit_params:  # LGB
            _fit_params['feature_name'] = 'auto'

        if 'feature_weights' in _fit_params:  # XGB  import warnings
            warnings.warn(
                "feature_weights is not supported when selecting features. "
                "It's automatically set to None.")
            _fit_params['feature_weights'] = None

        return _fit_params, estimator

    def _do_tests(self, dec_reg, hit_reg, iter_id):
        """Private method to operate Bonferroni corrections on the feature
        selections."""

        active_features = np.where(dec_reg >= 0)[0]
        hits = hit_reg[active_features]
        # get uncorrected p values based on hit_reg
        to_accept_ps = sp.stats.binom.sf(hits - 1, iter_id, .5).flatten()
        to_reject_ps = sp.stats.binom.cdf(hits, iter_id, .5).flatten()

        # Bonferroni correction with the total n_features in each iteration
        to_accept = to_accept_ps <= self.alpha / float(len(dec_reg))
        to_reject = to_reject_ps <= self.alpha / float(len(dec_reg))

        # find features which are 0 and have been rejected or accepted
        to_accept = np.where((dec_reg[active_features] == 0) * to_accept)[0]
        to_reject = np.where((dec_reg[active_features] == 0) * to_reject)[0]

        # updating dec_reg
        dec_reg[active_features[to_accept]] = 1
        dec_reg[active_features[to_reject]] = -1

        return dec_reg

    def fit(self, X, y, **fit_params):
        """Fit the Boruta algorithm to automatically tune
        the number of selected features."""

        self.boost_type_ = _check_boosting(self.estimator)

        if self.max_iter < 1:
            raise ValueError('max_iter should be an integer >0.')

        if self.perc <= 0 or self.perc > 100:
            raise ValueError('The percentile should be between 0 and 100.')

        if self.alpha <= 0 or self.alpha > 1:
            raise ValueError('alpha should be between 0 and 1.')

        if self.early_stopping_boruta_rounds is None:
            es_boruta_rounds = self.max_iter
        else:
            if self.early_stopping_boruta_rounds < 1:
                raise ValueError(
                    'early_stopping_boruta_rounds should be an integer >0.')
            es_boruta_rounds = self.early_stopping_boruta_rounds

        importances = ['feature_importances', 'shap_importances']
        if self.importance_type not in importances:
            raise ValueError(
                "importance_type must be one of {}. Get '{}'".format(
                    importances, self.importance_type))

        if self.importance_type == 'shap_importances':
            if not self.train_importance and not 'eval_set' in fit_params:
                raise ValueError(
                    "When train_importance is set to False, using "
                    "shap_importances, pass at least a eval_set.")
            eval_importance = not self.train_importance and 'eval_set' in fit_params

        shapes = np.shape(X)
        if len(shapes) != 2:
            raise ValueError("X must be 2D.")
        n_features = shapes[1]

        # create mask for user-defined categorical features
        self._cat_support = _get_categorical_support(n_features, fit_params)

        # holds the decision about each feature:
        # default (0); accepted (1); rejected (-1)
        dec_reg = np.zeros(n_features, dtype=int)
        dec_history = np.zeros((self.max_iter, n_features), dtype=int)
        # counts how many times a given feature was more important than
        # the best of the shadow features
        hit_reg = np.zeros(n_features, dtype=int)
        # record the history of the iterations
        imp_history = np.zeros(n_features, dtype=float)
        sha_max_history = []

        for i in range(self.max_iter):
            if (dec_reg != 0).all():
                if self.verbose > 1:
                    print("All Features analyzed. Boruta stop!")
                break

            if self.verbose > 1:
                print('Iteration: {} / {}'.format(i + 1, self.max_iter))

            self._random_state = np.random.RandomState(i + 1000)

            # add shadow attributes, shuffle and train estimator
            self.support_ = dec_reg >= 0
            feat_id_real = np.where(self.support_)[0]
            n_real = feat_id_real.shape[0]
            _fit_params, estimator = self._check_fit_params(fit_params, feat_id_real)
            estimator.set_params(random_state=i + 1000)
            _X = self._create_X(X, feat_id_real)
            with contextlib.redirect_stdout(io.StringIO()):
                estimator.fit(_X, y, **_fit_params)

            # get coefs
            if self.importance_type == 'feature_importances':
                coefs = _feature_importances(estimator)
            else:
                if eval_importance:
                    coefs = _shap_importances(
                        estimator, _fit_params['eval_set'][-1][0])
                else:
                    coefs = _shap_importances(estimator, _X)

                    # separate importances of real and shadow features
            imp_sha = coefs[n_real:]
            imp_real = np.zeros(n_features) * np.nan
            imp_real[feat_id_real] = coefs[:n_real]

            # get the threshold of shadow importances used for rejection
            imp_sha_max = np.percentile(imp_sha, self.perc)

            # record importance history
            sha_max_history.append(imp_sha_max)
            imp_history = np.vstack((imp_history, imp_real))

            # register which feature is more imp than the max of shadows
            hit_reg[np.where(imp_real[~np.isnan(imp_real)] > imp_sha_max)[0]] += 1

            # check if a feature is doing better than expected by chance
            dec_reg = self._do_tests(dec_reg, hit_reg, i + 1)
            dec_history[i] = dec_reg

            es_id = i - es_boruta_rounds
            if es_id >= 0:
                if np.equal(dec_history[es_id:(i + 1)], dec_reg).all():
                    if self.verbose > 0:
                        print("Boruta early stopping at iteration {}".format(i + 1))
                    break

        confirmed = np.where(dec_reg == 1)[0]
        tentative = np.where(dec_reg == 0)[0]

        self.support_ = np.zeros(n_features, dtype=bool)
        self.ranking_ = np.ones(n_features, dtype=int) * 4
        self.n_features_ = confirmed.shape[0]
        self.importance_history_ = imp_history[1:]

        if tentative.shape[0] > 0:
            tentative_median = np.nanmedian(imp_history[1:, tentative], axis=0)
            tentative_low = tentative[
                np.where(tentative_median <= np.median(sha_max_history))[0]]
            tentative_up = np.setdiff1d(tentative, tentative_low)

            self.ranking_[tentative_low] = 3
            if tentative_up.shape[0] > 0:
                self.ranking_[tentative_up] = 2

        if confirmed.shape[0] > 0:
            self.support_[confirmed] = True
            self.ranking_[confirmed] = 1

        if (~self.support_).all():
            raise RuntimeError(
                "Boruta didn't select any feature. Try to increase max_iter or "
                "increase (if not None) early_stopping_boruta_rounds or "
                "decrese perc.")

        _fit_params, self.estimator_ = self._check_fit_params(fit_params)
        with contextlib.redirect_stdout(io.StringIO()):
            self.estimator_.fit(self.transform(X), y, **_fit_params)

        return self


class _RFE(_BoostSelector):
    """Base class for BoostRFE meta-estimator.

    Warning: This class should not be used directly. Use derived classes
    instead.
    """

    def __init__(self,
                 estimator, *,
                 min_features_to_select=None,
                 step=1,
                 greater_is_better=False,
                 importance_type='feature_importances',
                 train_importance=True,
                 verbose=0):

        self.estimator = estimator
        self.min_features_to_select = min_features_to_select
        self.step = step
        self.greater_is_better = greater_is_better
        self.importance_type = importance_type
        self.train_importance = train_importance
        self.verbose = verbose

    def _check_fit_params(self, fit_params):
        """Private method to validate and check fit_params."""

        _fit_params = deepcopy(fit_params)
        estimator = clone(self.estimator)
        # add here possible estimator checks in each iteration

        _fit_params = _set_categorical_indexes(
            self.support_, self._cat_support, _fit_params)

        if 'eval_set' in _fit_params:
            _fit_params['eval_set'] = list(map(lambda x: (
                self.transform(x[0]), x[1]
            ), _fit_params['eval_set']))

        if 'feature_name' in _fit_params:  # LGB
            _fit_params['feature_name'] = 'auto'

        if 'feature_weights' in _fit_params:  # XGB  import warnings
            warnings.warn(
                "feature_weights is not supported when selecting features. "
                "It's automatically set to None.")
            _fit_params['feature_weights'] = None

        return _fit_params, estimator

    def _step_score(self, estimator):
        """Return the score for a fit on eval_set."""

        if self.boost_type_ == 'LGB':
            valid_id = list(estimator.best_score_.keys())[-1]
            eval_metric = list(estimator.best_score_[valid_id])[-1]
            score = estimator.best_score_[valid_id][eval_metric]
        else:
            # w/ eval_set and w/ early_stopping_rounds
            if hasattr(estimator, 'best_score'):
                score = estimator.best_score
            # w/ eval_set and w/o early_stopping_rounds
            else:
                valid_id = list(estimator.evals_result_.keys())[-1]
                eval_metric = list(estimator.evals_result_[valid_id])[-1]
                score = estimator.evals_result_[valid_id][eval_metric][-1]

        return score

    def fit(self, X, y, **fit_params):
        """Fit the RFE algorithm to automatically tune
        the number of selected features."""

        self.boost_type_ = _check_boosting(self.estimator)

        importances = ['feature_importances', 'shap_importances']
        if self.importance_type not in importances:
            raise ValueError(
                "importance_type must be one of {}. Get '{}'".format(
                    importances, self.importance_type))

        # scoring controls the calculation of self.score_history_
        # scoring is used automatically when 'eval_set' is in fit_params
        scoring = 'eval_set' in fit_params
        if self.importance_type == 'shap_importances':
            if not self.train_importance and not scoring:
                raise ValueError(
                    "When train_importance is set to False, using "
                    "shap_importances, pass at least a eval_set.")
            eval_importance = not self.train_importance and scoring

        shapes = np.shape(X)
        if len(shapes) != 2:
            raise ValueError("X must be 2D.")
        n_features = shapes[1]

        # create mask for user-defined categorical features
        self._cat_support = _get_categorical_support(n_features, fit_params)

        if self.min_features_to_select is None:
            if scoring:
                min_features_to_select = 1
            else:
                min_features_to_select = n_features // 2
        else:
            min_features_to_select = self.min_features_to_select

        if 0.0 < self.step < 1.0:
            step = int(max(1, self.step * n_features))
        else:
            step = int(self.step)
        if step <= 0:
            raise ValueError("Step must be >0.")

        self.support_ = np.ones(n_features, dtype=bool)
        self.ranking_ = np.ones(n_features, dtype=int)
        if scoring:
            self.score_history_ = []
            eval_score = np.max if self.greater_is_better else np.min
            best_score = -np.inf if self.greater_is_better else np.inf

        while np.sum(self.support_) > min_features_to_select:
            # remaining features
            features = np.arange(n_features)[self.support_]
            _fit_params, estimator = self._check_fit_params(fit_params)

            if self.verbose > 1:
                print("Fitting estimator with {} features".format(
                    self.support_.sum()))
            with contextlib.redirect_stdout(io.StringIO()):
                estimator.fit(self.transform(X), y, **_fit_params)

            # get coefs
            if self.importance_type == 'feature_importances':
                coefs = _feature_importances(estimator)
            else:
                if eval_importance:
                    coefs = _shap_importances(
                        estimator, _fit_params['eval_set'][-1][0])
                else:
                    coefs = _shap_importances(
                        estimator, self.transform(X))
            ranks = np.argsort(coefs)

            # eliminate the worse features
            threshold = min(step, np.sum(self.support_) - min_features_to_select)

            # compute step score on the previous selection iteration
            # because 'estimator' must use features
            # that have not been eliminated yet
            if scoring:
                score = self._step_score(estimator)
                self.score_history_.append(score)
                if best_score != eval_score([score, best_score]):
                    best_score = score
                    best_support = self.support_.copy()
                    best_ranking = self.ranking_.copy()
                    best_estimator = estimator

            self.support_[features[ranks][:threshold]] = False
            self.ranking_[np.logical_not(self.support_)] += 1

        # set final attributes
        _fit_params, self.estimator_ = self._check_fit_params(fit_params)
        if self.verbose > 1:
            print("Fitting estimator with {} features".format(self.support_.sum()))
        with contextlib.redirect_stdout(io.StringIO()):
            self.estimator_.fit(self.transform(X), y, **_fit_params)

        # compute step score when only min_features_to_select features left
        if scoring:
            score = self._step_score(self.estimator_)
            self.score_history_.append(score)
            if best_score == eval_score([score, best_score]):
                self.support_ = best_support
                self.ranking_ = best_ranking
                self.estimator_ = best_estimator
        self.n_features_ = self.support_.sum()

        return self


class _RFA(_BoostSelector):
    """Base class for BoostRFA meta-estimator.

    Warning: This class should not be used directly. Use derived classes
    instead.
    """

    def __init__(self,
                 estimator, *,
                 min_features_to_select=None,
                 step=1,
                 greater_is_better=False,
                 importance_type='feature_importances',
                 train_importance=True,
                 verbose=0):

        self.estimator = estimator
        self.min_features_to_select = min_features_to_select
        self.step = step
        self.greater_is_better = greater_is_better
        self.importance_type = importance_type
        self.train_importance = train_importance
        self.verbose = verbose

    def _check_fit_params(self, fit_params, inverse=False):
        """Private method to validate and check fit_params."""

        _fit_params = deepcopy(fit_params)
        estimator = clone(self.estimator)
        # add here possible estimator checks in each iteration

        _fit_params = _set_categorical_indexes(
            self.support_, self._cat_support, _fit_params)

        if 'eval_set' in _fit_params:
            _fit_params['eval_set'] = list(map(lambda x: (
                self._transform(x[0], inverse), x[1]
            ), _fit_params['eval_set']))

        if 'feature_name' in _fit_params:  # LGB
            _fit_params['feature_name'] = 'auto'

        if 'feature_weights' in _fit_params:  # XGB  import warnings
            warnings.warn(
                "feature_weights is not supported when selecting features. "
                "It's automatically set to None.")
            _fit_params['feature_weights'] = None

        return _fit_params, estimator

    def _step_score(self, estimator):
        """Return the score for a fit on eval_set."""

        if self.boost_type_ == 'LGB':
            valid_id = list(estimator.best_score_.keys())[-1]
            eval_metric = list(estimator.best_score_[valid_id])[-1]
            score = estimator.best_score_[valid_id][eval_metric]
        else:
            # w/ eval_set and w/ early_stopping_rounds
            if hasattr(estimator, 'best_score'):
                score = estimator.best_score
            # w/ eval_set and w/o early_stopping_rounds
            else:
                valid_id = list(estimator.evals_result_.keys())[-1]
                eval_metric = list(estimator.evals_result_[valid_id])[-1]
                score = estimator.evals_result_[valid_id][eval_metric][-1]

        return score

    def fit(self, X, y, **fit_params):
        """Fit the RFA algorithm to automatically tune
        the number of selected features."""

        self.boost_type_ = _check_boosting(self.estimator)

        importances = ['feature_importances', 'shap_importances']
        if self.importance_type not in importances:
            raise ValueError(
                "importance_type must be one of {}. Get '{}'".format(
                    importances, self.importance_type))

        # scoring controls the calculation of self.score_history_
        # scoring is used automatically when 'eval_set' is in fit_params
        scoring = 'eval_set' in fit_params
        if self.importance_type == 'shap_importances':
            if not self.train_importance and not scoring:
                raise ValueError(
                    "When train_importance is set to False, using "
                    "shap_importances, pass at least a eval_set.")
            eval_importance = not self.train_importance and scoring

        shapes = np.shape(X)
        if len(shapes) != 2:
            raise ValueError("X must be 2D.")
        n_features = shapes[1]

        # create mask for user-defined categorical features
        self._cat_support = _get_categorical_support(n_features, fit_params)

        if self.min_features_to_select is None:
            if scoring:
                min_features_to_select = 1
            else:
                min_features_to_select = n_features // 2
        else:
            if scoring:
                min_features_to_select = self.min_features_to_select
            else:
                min_features_to_select = n_features - self.min_features_to_select

        if 0.0 < self.step < 1.0:
            step = int(max(1, self.step * n_features))
        else:
            step = int(self.step)
        if step <= 0:
            raise ValueError("Step must be >0.")

        self.support_ = np.zeros(n_features, dtype=bool)
        self._support = np.ones(n_features, dtype=bool)
        self.ranking_ = np.ones(n_features, dtype=int)
        self._ranking = np.ones(n_features, dtype=int)
        if scoring:
            self.score_history_ = []
            eval_score = np.max if self.greater_is_better else np.min
            best_score = -np.inf if self.greater_is_better else np.inf

        while np.sum(self._support) > min_features_to_select:
            # remaining features
            features = np.arange(n_features)[self._support]

            # scoring the previous added features
            if scoring and np.sum(self.support_) > 0:
                _fit_params, estimator = self._check_fit_params(fit_params)
                with contextlib.redirect_stdout(io.StringIO()):
                    estimator.fit(self._transform(X, inverse=False), y, **_fit_params)
                score = self._step_score(estimator)
                self.score_history_.append(score)
                if best_score != eval_score([score, best_score]):
                    best_score = score
                    best_support = self.support_.copy()
                    best_ranking = self.ranking_.copy()
                    best_estimator = estimator

            # evaluate the remaining features
            _fit_params, _estimator = self._check_fit_params(fit_params, inverse=True)
            if self.verbose > 1:
                print("Fitting estimator with {} features".format(self._support.sum()))
            with contextlib.redirect_stdout(io.StringIO()):
                _estimator.fit(self._transform(X, inverse=True), y, **_fit_params)
                if self._support.sum() == n_features:
                    all_features_estimator = _estimator

            # get coefs
            if self.importance_type == 'feature_importances':
                coefs = _feature_importances(_estimator)
            else:
                if eval_importance:
                    coefs = _shap_importances(
                        _estimator, _fit_params['eval_set'][-1][0])
                else:
                    coefs = _shap_importances(
                        _estimator, self._transform(X, inverse=True))
            ranks = np.argsort(-coefs)  # the rank is inverted

            # add the best features
            threshold = min(step, np.sum(self._support) - min_features_to_select)

            # remaining features to test
            self._support[features[ranks][:threshold]] = False
            self._ranking[np.logical_not(self._support)] += 1
            # features tested
            self.support_[features[ranks][:threshold]] = True
            self.ranking_[np.logical_not(self.support_)] += 1

        # set final attributes
        _fit_params, self.estimator_ = self._check_fit_params(fit_params)
        if self.verbose > 1:
            print("Fitting estimator with {} features".format(self._support.sum()))
        with contextlib.redirect_stdout(io.StringIO()):
            self.estimator_.fit(self._transform(X, inverse=False), y, **_fit_params)

        # compute step score when only min_features_to_select features left
        if scoring:
            score = self._step_score(self.estimator_)
            self.score_history_.append(score)
            if best_score == eval_score([score, best_score]):
                self.support_ = best_support
                self.ranking_ = best_ranking
                self.estimator_ = best_estimator

            if len(set(self.score_history_)) == 1:
                self.support_ = np.ones(n_features, dtype=bool)
                self.ranking_ = np.ones(n_features, dtype=int)
                self.estimator_ = all_features_estimator
        self.n_features_ = self.support_.sum()

        return self

    def _transform(self, X, inverse=False):
        """Private method to reduce the input X to the features selected."""

        shapes = np.shape(X)
        if len(shapes) != 2:
            raise ValueError("X must be 2D.")

        if shapes[1] != self.support_.shape[0]:
            raise ValueError(
                "Expected {} features, received {}.".format(
                    self.support_.shape[0], shapes[1]))

        if inverse:
            if isinstance(X, np.ndarray):
                return X[:, self._support]
            elif hasattr(X, 'loc'):
                return X.loc[:, self._support]
            elif sp.sparse.issparse(X):
                return X[:, self._support]
            else:
                raise ValueError("Data type not understood.")
        else:
            if isinstance(X, np.ndarray):
                return X[:, self.support_]
            elif hasattr(X, 'loc'):
                return X.loc[:, self.support_]
            elif sp.sparse.issparse(X):
                return X[:, self.support_]
            else:
                raise ValueError("Data type not understood.")

    def transform(self, X):
        """Reduces the input X to the features selected with RFA.

        Parameters
        ----------
        X : array-like of shape (n_samples, n_features)
            Samples.

        Returns
        -------
        X : array-like of shape (n_samples, n_features_)
            The input samples with only the selected features by Boruta.
        """

        check_is_fitted(self)

        return self._transform(X, inverse=False)


================================================
FILE: shaphypetune/shaphypetune.py
================================================
from sklearn.base import clone

from ._classes import _BoostSearch, _Boruta, _RFA, _RFE


class BoostSearch(_BoostSearch):
    """Hyperparamater searching and optimization on a given validation set
    for LGBModel or XGBModel. 

    Pass a LGBModel or XGBModel, and a dictionary with the parameter boundaries 
    for grid, random or bayesian search. 
    To operate random search pass distributions in the param_grid with rvs 
    method for sampling (such as those from scipy.stats.distributions). 
    To operate bayesian search pass hyperopt distributions.   
    The specification of n_iter or sampling_seed is effective only with random
    or hyperopt searches.
    The best parameter combination is the one which obtain the better score
    (as returned by eval_metric) on the provided eval_set.

    If all parameters are presented as a list/floats/integers, grid-search 
    is performed. If at least one parameter is given as a distribution (such as 
    those from scipy.stats.distributions), random-search is performed computing
    sampling with replacement. Bayesian search is effective only when all the 
    parameters to tune are in form of hyperopt distributions. 
    It is highly recommended to use continuous distributions for continuous 
    parameters.

    Parameters
    ----------
    estimator : object
        A supervised learning estimator of LGBModel or XGBModel type.

    param_grid : dict
        Dictionary with parameters names (`str`) as keys and distributions
        or lists of parameters to try. 

    greater_is_better : bool, default=False
        Whether the quantity to monitor is a score function, 
        meaning high is good, or a loss function, meaning low is good.

    n_iter : int, default=None
        Effective only for random or hyperopt search.
        Number of parameter settings that are sampled. 
        n_iter trades off runtime vs quality of the solution.

    sampling_seed : int, default=None
        Effective only for random or hyperopt search.
        The seed used to sample from the hyperparameter distributions.

    n_jobs : int, default=None
        Effective only with grid and random search.
        The number of jobs to run in parallel for model fitting.
        ``None`` means 1 using one processor. ``-1`` means using all
        processors.

    verbose : int, default=1
        Verbosity mode. <=0 silent all; >0 print trial logs with the 
        connected score.

    Attributes
    ----------
    estimator_ : estimator
        Estimator that was chosen by the search, i.e. estimator
        which gave the best score on the eval_set.

    best_params_ : dict
        Parameter setting that gave the best results on the eval_set.

    trials_ : list
        A list of dicts. The dicts are all the parameter combinations tried 
        and derived from the param_grid.

    best_score_ : float
        The best score achieved by all the possible combination created.

    scores_ : list
        The scores achieved on the eval_set by all the models tried.

    best_iter_ : int
        The boosting iterations achieved by the best parameters combination.

    iterations_ : list
        The boosting iterations of all the models tried.

    boost_type_ : str
        The type of the boosting estimator (LGB or XGB).
    """

    def __init__(self,
                 estimator, *,
                 param_grid,
                 greater_is_better=False,
                 n_iter=None,
                 sampling_seed=None,
                 verbose=1,
                 n_jobs=None):
        self.estimator = estimator
        self.param_grid = param_grid
        self.greater_is_better = greater_is_better
        self.n_iter = n_iter
        self.sampling_seed = sampling_seed
        self.verbose = verbose
        self.n_jobs = n_jobs

    def _build_model(self, params):
        """Private method to build model."""

        model = clone(self.estimator)
        model.set_params(**params)

        return model


class BoostBoruta(_BoostSearch, _Boruta):
    """Simultaneous features selection with Boruta algorithm and hyperparamater
    searching on a given validation set for LGBModel or XGBModel.

    Pass a LGBModel or XGBModel to compute features selection with Boruta
    algorithm. The best features are used to train a new gradient boosting
    instance. When a eval_set is provided, shadow features are build also on it.

    If param_grid is a dictionary with parameter boundaries, a hyperparameter
    tuning is computed simultaneously. The parameter combinations are scored on
    the provided eval_set.
    To operate random search pass distributions in the param_grid with rvs
    method for sampling (such as those from scipy.stats.distributions).
    To operate bayesian search pass hyperopt distributions.
    The specification of n_iter or sampling_seed is effective only with random
    or hyperopt searches.
    The best parameter combination is the one which obtain the better score
    (as returned by eval_metric) on the provided eval_set.

    If all parameters are presented as a list/floats/integers, grid-search
    is performed. If at least one parameter is given as a distribution (such as
    those from scipy.stats.distributions), random-search is performed computing
    sampling with replacement. Bayesian search is effective only when all the
    parameters to tune are in form of hyperopt distributions.
    It is highly recommended to use continuous distributions for continuous
    parameters.

    Parameters
    ----------
    estimator : object
        A supervised learning estimator of LGBModel or XGBModel type.

    perc : int, default=100
        Threshold for comparison between shadow and real features.
        The lower perc is the more false positives will be picked as relevant
        but also the less relevant features will be left out.
        100 correspond to the max.

    alpha : float, default=0.05
        Level at which the corrected p-values will get rejected in the
        correction steps.

    max_iter : int, default=100
        The number of maximum Boruta iterations to perform.

    early_stopping_boruta_rounds : int, default=None
        The maximum amount of iterations without confirming a tentative
        feature. Use early stopping to terminate the selection process
        before reaching `max_iter` iterations if the algorithm cannot
        confirm a tentative feature after N iterations.
        None means no early stopping search.

    importance_type : str, default='feature_importances'
         Which importance measure to use. It can be 'feature_importances'
         (the default feature importance of the gradient boosting estimator)
         or 'shap_importances'.

    train_importance : bool, default=True
        Effective only when importance_type='shap_importances'.
        Where to compute the shap feature importance: on train (True)
        or on eval_set (False).

    param_grid : dict, default=None
        Dictionary with parameters names (`str`) as keys and distributions
        or lists of parameters to try.
        None means no hyperparameters search.

    greater_is_better : bool, default=False
        Effective only when hyperparameters searching.
        Whether the quantity to monitor is a score function,
        meaning high is good, or a loss function, meaning low is good.

    n_iter : int, default=None
        Effective only when hyperparameters searching.
        Effective only for random or hyperopt seraches.
        Number of parameter settings that are sampled.
        n_iter trades off runtime vs quality of the solution.

    sampling_seed : int, default=None
        Effective only when hyperparameters searching.
        Effective only for random or hyperopt serach.
        The seed used to sample from the hyperparameter distributions.

    n_jobs : int, default=None
        Effective only when hyperparameters searching without hyperopt.
        The number of jobs to run in parallel for model fitting.
        ``None`` means 1 using one processor. ``-1`` means using all
        processors.

    verbose : int, default=1
        Verbosity mode. <=0 silent all; ==1 print trial logs (when
        hyperparameters searching); >1 print feature selection logs plus
        trial logs (when hyperparameters searching).

    Attributes
    ----------
    estimator_ : estimator
        The fitted estimator with the select features and the optimal
        parameter combination (when hyperparameters searching).

    n_features_ : int
        The number of selected features (from the best param config
        when hyperparameters searching).

    ranking_ : ndarray of shape (n_features,)
        The feature ranking, such that ``ranking_[i]`` corresponds to the
        ranking position of the i-th feature (from the best param config
        when hyperparameters searching). Selected features are assigned
        rank 1 (2: tentative upper bound, 3: tentative lower bound, 4:
        rejected).

    support_ : ndarray of shape (n_features,)
        The mask of selected features (from the best param config
        when hyperparameters searching).

    importance_history_ : ndarray of shape (n_features, n_iters)
        The importance values for each feature across all iterations.

    best_params_ : dict
        Available only when hyperparameters searching.
        Parameter setting that gave the best results on the eval_set.

    trials_ : list
        Available only when hyperparameters searching.
        A list of dicts. The dicts are all the parameter combinations tried
        and derived from the param_grid.

    best_score_ : float
        Available only when hyperparameters searching.
        The best score achieved by all the possible combination created.

    scores_ : list
        Available only when hyperparameters searching.
        The scores achived on the eval_set by all the models tried.

    best_iter_ : int
        Available only when hyperparameters searching.
        The boosting iterations achieved by the best parameters combination.

    iterations_ : list
        Available only when hyperparameters searching.
        The boosting iterations of all the models tried.

    boost_type_ : str
        The type of the boosting estimator (LGB or XGB).

    Notes
    -----
    The code for the Boruta algorithm is inspired and improved from:
    https://github.com/scikit-learn-contrib/boruta_py
    """

    def __init__(self,
                 estimator, *,
                 perc=100,
                 alpha=0.05,
                 max_iter=100,
                 early_stopping_boruta_rounds=None,
                 param_grid=None,
                 greater_is_better=False,
                 importance_type='feature_importances',
                 train_importance=True,
                 n_iter=None,
                 sampling_seed=None,
                 verbose=1,
                 n_jobs=None):

        self.estimator = estimator
        self.perc = perc
        self.alpha = alpha
        self.max_iter = max_iter
        self.early_stopping_boruta_rounds = early_stopping_boruta_rounds
        self.param_grid = param_grid
        self.greater_is_better = greater_is_better
        self.importance_type = importance_type
        self.train_importance = train_importance
        self.n_iter = n_iter
        self.sampling_seed = sampling_seed
        self.verbose = verbose
        self.n_jobs = n_jobs

    def _build_model(self, params=None):
        """Private method to build model."""

        estimator = clone(self.estimator)

        if params is None:
            model = _Boruta(
                estimator=estimator,
                perc=self.perc,
                alpha=self.alpha,
                max_iter=self.max_iter,
                early_stopping_boruta_rounds=self.early_stopping_boruta_rounds,
                importance_type=self.importance_type,
                train_importance=self.train_importance,
                verbose=self.verbose
            )

        else:
            estimator.set_params(**params)
            model = _Boruta(
                estimator=estimator,
                perc=self.perc,
                alpha=self.alpha,
                max_iter=self.max_iter,
                early_stopping_boruta_rounds=self.early_stopping_boruta_rounds,
                importance_type=self.importance_type,
                train_importance=self.train_importance,
                verbose=self.verbose
            )

        return model


class BoostRFE(_BoostSearch, _RFE):
    """Simultaneous features selection with RFE and hyperparamater searching
    on a given validation set for LGBModel or XGBModel.

    Pass a LGBModel or XGBModel to compute features selection with RFE.
    The gradient boosting instance with the best features is selected.
    When a eval_set is provided, the best gradient boosting and the best
    features are obtained evaluating the score with eval_metric.
    Otherwise, the best combination is obtained looking only at feature
    importance.

    If param_grid is a dictionary with parameter boundaries, a hyperparameter
    tuning is computed simultaneously. The parameter combinations are scored on
    the provided eval_set.
    To operate random search pass distributions in the param_grid with rvs
    method for sampling (such as those from scipy.stats.distributions).
    To operate bayesian search pass hyperopt distributions.
    The specification of n_iter or sampling_seed is effective only with random
    or hyperopt searches.
    The best parameter combination is the one which obtain the better score
    (as returned by eval_metric) on the provided eval_set.

    If all parameters are presented as a list/floats/integers, grid-search
    is performed. If at least one parameter is given as a distribution (such as
    those from scipy.stats.distributions), random-search is performed computing
    sampling with replacement. Bayesian search is effective only when all the
    parameters to tune are in form of hyperopt distributions.
    It is highly recommended to use continuous distributions for continuous
    parameters.

    Parameters
    ----------
    estimator : object
        A supervised learning estimator of LGBModel or XGBModel type.

    step : int or float, default=1
        If greater than or equal to 1, then `step` corresponds to the
        (integer) number of features to remove at each iteration.
        If within (0.0, 1.0), then `step` corresponds to the percentage
        (rounded down) of features to remove at each iteration.
        Note that the last iteration may remove fewer than `step` features in
        order to reach `min_features_to_select`.

    min_features_to_select : int, default=None
        The minimum number of features to be selected. This number of features
        will always be scored, even if the difference between the original
        feature count and `min_features_to_select` isn't divisible by
        `step`. The default value for min_features_to_select is set to 1 when a
        eval_set is provided, otherwise it always corresponds to n_features // 2.

    importance_type : str, default='feature_importances'
         Which importance measure to use. It can be 'feature_importances'
         (the default feature importance of the gradient boosting estimator)
         or 'shap_importances'.

    train_importance : bool, default=True
        Effective only when importance_type='shap_importances'.
        Where to compute the shap feature importance: on train (True)
        or on eval_set (False).

    param_grid : dict, default=None
        Dictionary with parameters names (`str`) as keys and distributions
        or lists of parameters to try.
        None means no hyperparameters search.

    greater_is_better : bool, default=False
        Effective only when hyperparameters searching.
        Whether the quantity to monitor is a score function,
        meaning high is good, or a loss function, meaning low is good.

    n_iter : int, default=None
        Effective only when hyperparameters searching.
        Effective only for random or hyperopt serach.
        Number of parameter settings that are sampled.
        n_iter trades off runtime vs quality of the solution.

    sampling_seed : int, default=None
        Effective only when hyperparameters searching.
        Effective only for random or hyperopt serach.
        The seed used to sample from the hyperparameter distributions.

    n_jobs : int, default=None
        Effective only when hyperparameters searching without hyperopt.
        The number of jobs to run in parallel for model fitting.
        ``None`` means 1 using one processor. ``-1`` means using all
        processors.

    verbose : int, default=1
        Verbosity mode. <=0 silent all; ==1 print trial logs (when
        hyperparameters searching); >1 print feature selection logs plus
        trial logs (when hyperparameters searching).

    Attributes
    ----------
    estimator_ : estimator
        The fitted estimator with the select features and the optimal
        parameter combination (when hyperparameters searching).

    n_features_ : int
        The number of selected features (from the best param config
        when hyperparameters searching).

    ranking_ : ndarray of shape (n_features,)
        The feature ranking, such that ``ranking_[i]`` corresponds to the
        ranking position of the i-th feature (from the best param config
        when hyperparameters searching). Selected  features are assigned
        rank 1.

    support_ : ndarray of shape (n_features,)
        The mask of selected features (from the best param config
        when hyperparameters searching).

    score_history_ : list
        Available only when a eval_set is provided.
        Scores obtained reducing the features (from the best param config
        when hyperparameters searching).

    best_params_ : dict
        Available only when hyperparameters searching.
        Parameter setting that gave the best results on the eval_set.

    trials_ : list
        Available only when hyperparameters searching.
        A list of dicts. The dicts are all the parameter combinations tried
        and derived from the param_grid.

    best_score_ : float
        Available only when hyperparameters searching.
        The best score achieved by all the possible combination created.

    scores_ : list
        Available only when hyperparameters searching.
        The scores achieved on the eval_set by all the models tried.

    best_iter_ : int
        Available only when hyperparameters searching.
        The boosting iterations achieved by the best parameters combination.

    iterations_ : list
        Available only when hyperparameters searching.
        The boosting iterations of all the models tried.

    boost_type_ : str
        The type of the boosting estimator (LGB or XGB).
    """

    def __init__(self,
                 estimator, *,
                 min_features_to_select=None,
                 step=1,
                 param_grid=None,
                 greater_is_better=False,
                 importance_type='feature_importances',
                 train_importance=True,
                 n_iter=None,
                 sampling_seed=None,
                 verbose=1,
                 n_jobs=None):

        self.estimator = estimator
        self.min_features_to_select = min_features_to_select
        self.step = step
        self.param_grid = param_grid
        self.greater_is_better = greater_is_better
        self.importance_type = importance_type
        self.train_importance = train_importance
        self.n_iter = n_iter
        self.sampling_seed = sampling_seed
        self.verbose = verbose
        self.n_jobs = n_jobs

    def _build_model(self, params=None):
        """Private method to build model."""

        estimator = clone(self.estimator)

        if params is None:
            model = _RFE(
                estimator=estimator,
                min_features_to_select=self.min_features_to_select,
                step=self.step,
                greater_is_better=self.greater_is_better,
                importance_type=self.importance_type,
                train_importance=self.train_importance,
                verbose=self.verbose
            )

        else:
            estimator.set_params(**params)
            model = _RFE(
                estimator=estimator,
                min_features_to_select=self.min_features_to_select,
                step=self.step,
                greater_is_better=self.greater_is_better,
                importance_type=self.importance_type,
                train_importance=self.train_importance,
                verbose=self.verbose
            )

        return model


class BoostRFA(_BoostSearch, _RFA):
    """Simultaneous features selection with RFA and hyperparamater searching
    on a given validation set for LGBModel or XGBModel.

    Pass a LGBModel or XGBModel to compute features selection with RFA.
    The gradient boosting instance with the best features is selected.
    When a eval_set is provided, the best gradient boosting and the best
    features are obtained evaluating the score with eval_metric.
    Otherwise, the best combination is obtained looking only at feature
    importance.

    If param_grid is a dictionary with parameter boundaries, a hyperparameter
    tuning is computed simultaneously. The parameter combinations are scored on
    the provided eval_set.
    To operate random search pass distributions in the param_grid with rvs
    method for sampling (such as those from scipy.stats.distributions).
    To operate bayesian search pass hyperopt distributions.
    The specification of n_iter or sampling_seed is effective only with random
    or hyperopt searches.
    The best parameter combination is the one which obtain the better score
    (as returned by eval_metric) on the provided eval_set.

    If all parameters are presented as a list/floats/integers, grid-search
    is performed. If at least one parameter is given as a distribution (such as
    those from scipy.stats.distributions), random-search is performed computing
    sampling with replacement. Bayesian search is effective only when all the
    parameters to tune are in form of hyperopt distributions.
    It is highly recommended to use continuous distributions for continuous
    parameters.

    Parameters
    ----------
    estimator : object
        A supervised learning estimator of LGBModel or XGBModel type.

    step : int or float, default=1
        If greater than or equal to 1, then `step` corresponds to the
        (integer) number of features to remove at each iteration.
        If within (0.0, 1.0), then `step` corresponds to the percentage
        (rounded down) of features to remove at each iteration.
        Note that the last iteration may remove fewer than `step` features in
        order to reach `min_features_to_select`.

    min_features_to_select : int, default=None
        The minimum number of features to be selected. This number of features
        will always be scored, even if the difference between the original
        feature count and `min_features_to_select` isn't divisible by
        `step`. The default value for min_features_to_select is set to 1 when a
        eval_set is provided, otherwise it always corresponds to n_features // 2.

    importance_type : str, default='feature_importances'
         Which importance measure to use. It can be 'feature_importances'
         (the default feature importance of the gradient boosting estimator)
         or 'shap_importances'.

    train_importance : bool, default=True
        Effective only when importance_type='shap_importances'.
        Where to compute the shap feature importance: on train (True)
        or on eval_set (False).

    param_grid : dict, default=None
        Dictionary with parameters names (`str`) as keys and distributions
        or lists of parameters to try.
        None means no hyperparameters search.

    greater_is_better : bool, default=False
        Effective only when hyperparameters searching.
        Whether the quantity to monitor is a score function,
        meaning high is good, or a loss function, meaning low is good.

    n_iter : int, default=None
        Effective only when hyperparameters searching.
        Effective only for random or hyperopt serach.
        Number of parameter settings that are sampled.
        n_iter trades off runtime vs quality of the solution.

    sampling_seed : int, default=None
        Effective only when hyperparameters searching.
        Effective only for random or hyperopt serach.
        The seed used to sample from the hyperparameter distributions.

    n_jobs : int, default=None
        Effective only when hyperparameters searching without hyperopt.
        The number of jobs to run in parallel for model fitting.
        ``None`` means 1 using one processor. ``-1`` means using all
        processors.

    verbose : int, default=1
        Verbosity mode. <=0 silent all; ==1 print trial logs (when
        hyperparameters searching); >1 print feature selection logs plus
        trial logs (when hyperparameters searching).

    Attributes
    ----------
    estimator_ : estimator
        The fitted estimator with the select features and the optimal
        parameter combination (when hyperparameters searching).

    n_features_ : int
        The number of selected features (from the best param config
        when hyperparameters searching).

    ranking_ : ndarray of shape (n_features,)
        The feature ranking, such that ``ranking_[i]`` corresponds to the
        ranking position of the i-th feature (from the best param config
        when hyperparameters searching). Selected  features are assigned
        rank 1.

    support_ : ndarray of shape (n_features,)
        The mask of selected features (from the best param config
        when hyperparameters searching).

    score_history_ : list
        Available only when a eval_set is provided.
        Scores obtained reducing the features (from the best param config
        when hyperparameters searching).

    best_params_ : dict
        Available only when hyperparameters searching.
        Parameter setting that gave the best results on the eval_set.

    trials_ : list
        Available only when hyperparameters searching.
        A list of dicts. The dicts are all the parameter combinations tried
        and derived from the param_grid.

    best_score_ : float
        Available only when hyperparameters searching.
        The best score achieved by all the possible combination created.

    scores_ : list
        Available only when hyperparameters searching.
        The scores achieved on the eval_set by all the models tried.

    best_iter_ : int
        Available only when hyperparameters searching.
        The boosting iterations achieved by the best parameters combination.

    iterations_ : list
        Available only when hyperparameters searching.
        The boosting iterations of all the models tried.

    boost_type_ : str
        The type of the boosting estimator (LGB or XGB).

    Notes
    -----
    The code for the RFA algorithm is inspired and improved from:
    https://github.com/heberleh/recursive-feature-addition
    """

    def __init__(self,
                 estimator, *,
                 min_features_to_select=None,
                 step=1,
                 param_grid=None,
                 greater_is_better=False,
                 importance_type='feature_importances',
                 train_importance=True,
                 n_iter=None,
                 sampling_seed=None,
                 verbose=1,
                 n_jobs=None):

        self.estimator = estimator
        self.min_features_to_select = min_features_to_select
        self.step = step
        self.param_grid = param_grid
        self.greater_is_better = greater_is_better
        self.importance_type = importance_type
        self.train_importance = train_importance
        self.n_iter = n_iter
        self.sampling_seed = sampling_seed
        self.verbose = verbose
        self.n_jobs = n_jobs

    def _build_model(self, params=None):
        """Private method to build model."""

        estimator = clone(self.estimator)

        if params is None:
            model = _RFA(
                estimator=estimator,
                min_features_to_select=self.min_features_to_select,
                step=self.step,
                greater_is_better=self.greater_is_better,
                importance_type=self.importance_type,
                train_importance=self.train_importance,
                verbose=self.verbose
            )

        else:
            estimator.set_params(**params)
            model = _RFA(
                estimator=estimator,
                min_features_to_select=self.min_features_to_select,
                step=self.step,
                greater_is_better=self.greater_is_better,
                importance_type=self.importance_type,
                train_importance=self.train_importance,
                verbose=self.verbose
            )

        return model

================================================
FILE: shaphypetune/utils.py
================================================
import random
import numpy as np
from itertools import product

from shap import TreeExplainer


def _check_boosting(model):
    """Check if the estimator is a LGBModel or XGBModel.

    Returns
    -------
    Model type in string format.
    """

    estimator_type = str(type(model)).lower()

    boost_type = ('LGB' if 'lightgbm' in estimator_type else '') + \
                 ('XGB' if 'xgboost' in estimator_type else '')

    if len(boost_type) != 3:
        raise ValueError("Pass a LGBModel or XGBModel.")

    return boost_type


def _shap_importances(model, X):
    """Extract feature importances from fitted boosting models
    using TreeExplainer from shap.

    Returns
    -------
    array of feature importances.
    """

    explainer = TreeExplainer(
        model, feature_perturbation="tree_path_dependent")
    coefs = explainer.shap_values(X)

    if isinstance(coefs, list):
        coefs = list(map(lambda x: np.abs(x).mean(0), coefs))
        coefs = np.sum(coefs, axis=0)
    else:
        coefs = np.abs(coefs).mean(0)

    return coefs


def _feature_importances(model):
    """Extract feature importances from fitted boosting models.

    Returns
    -------
    array of feature importances.
    """

    if hasattr(model, 'coef_'):  ## booster='gblinear' (xgb)
        coefs = np.square(model.coef_).sum(axis=0)
    else:
        coefs = model.feature_importances_

    return coefs


def _get_categorical_support(n_features, fit_params):
    """Obtain boolean mask for categorical features"""

    cat_support = np.zeros(n_features, dtype=bool)
    cat_ids = []

    msg = "When manually setting categarical features, " \
          "pass a 1D array-like of categorical columns indices " \
          "(specified as integers)."

    if 'categorical_feature' in fit_params:  # LGB
        cat_ids = fit_params['categorical_feature']
        if len(np.shape(cat_ids)) != 1:
            raise ValueError(msg)
        if not all([isinstance(c, int) for c in cat_ids]):
            raise ValueError(msg)

    cat_support[cat_ids] = True

    return cat_support


def _set_categorical_indexes(support, cat_support, _fit_params,
                             duplicate=False):
    """Map categorical features in each data repartition"""

    if cat_support.any():

        n_features = support.sum()
        support_id = np.zeros_like(support, dtype='int32')
        support_id[support] = np.arange(n_features, dtype='int32')
        cat_feat = support_id[np.where(support & cat_support)[0]]
        # empty if support and cat_support are not alligned

        if duplicate:  # is Boruta
            cat_feat = cat_feat.tolist() + (n_features + cat_feat).tolist()
        else:
            cat_feat = cat_feat.tolist()

        _fit_params['categorical_feature'] = cat_feat

    return _fit_params


def _check_param(values):
    """Check the parameter boundaries passed in dict values.

    Returns
    -------
    list of checked parameters.
    """

    if isinstance(values, (list, tuple, np.ndarray)):
        return list(set(values))
    elif 'scipy' in str(type(values)).lower():
        return values
    elif 'hyperopt' in str(type(values)).lower():
        return values
    else:
        return [values]


class ParameterSampler(object):
    """Generator on parameters sampled from given distributions.
    If all parameters are presented as a list, sampling without replacement is
    performed. If at least one parameter is given as a scipy distribution,
    sampling with replacement is used. If all parameters are given as hyperopt
    distributions Tree of Parzen Estimators searching from hyperopt is computed.
    It is highly recommended to use continuous distributions for continuous
    parameters.

    Parameters
    ----------
    param_distributions : dict
        Dictionary with parameters names (`str`) as keys and distributions
        or lists of parameters to try. Distributions must provide a ``rvs``
        method for random sampling (such as those from scipy.stats.distributions)
        or be hyperopt distributions for bayesian searching.
        If a list is given, it is sampled uniformly.

    n_iter : integer, default=None
        Number of parameter configurations that are produced.

    random_state : int, default=None
        Pass an int for reproducible output across multiple
        function calls.

    Returns
    -------
    param_combi : list of dicts or dict of hyperopt distributions
        Parameter combinations.

    searching_type : str
        The searching algorithm used.
    """

    def __init__(self, param_distributions, n_iter=None, random_state=None):

        self.n_iter = n_iter
        self.random_state = random_state
        self.param_distributions = param_distributions

    def sample(self):
        """Generator parameter combinations from given distributions."""

        param_distributions = self.param_distributions.copy()

        is_grid = all(isinstance(p, list)
                      for p in param_distributions.values())
        is_random = all(isinstance(p, list) or 'scipy' in str(type(p)).lower()
                        for p in param_distributions.values())
        is_hyperopt = all('hyperopt' in str(type(p)).lower()
                          or (len(p) < 2 if isinstance(p, list) else False)
                          for p in param_distributions.values())

        if is_grid:
            param_combi = list(product(*param_distributions.values()))
            param_combi = [
                dict(zip(param_distributions.keys(), combi))
                for combi in param_combi
            ]
            return param_combi, 'grid'

        elif is_random:
            if self.n_iter is None:
                raise ValueError(
                    "n_iter must be an integer >0 when scipy parameter "
                    "distributions are provided. Get None."
                )

            seed = (random.randint(1, 100) if self.random_state is None
                    else self.random_state + 1)
            random.seed(seed)

            param_combi = []
            k = self.n_iter
            for i in range(self.n_iter):
                dist = param_distributions.copy()
                combi = []
                for j, v in enumerate(dist.values()):
                    if 'scipy' in str(type(v)).lower():
                        combi.append(v.rvs(random_state=seed * (k + j)))
                    else:
                        combi.append(v[random.randint(0, len(v) - 1)])
                    k += i + j
                param_combi.append(
                    dict(zip(param_distributions.keys(), combi))
                )
            np.random.mtrand._rand

            return param_combi, 'random'

        elif is_hyperopt:
            if self.n_iter is None:
                raise ValueError(
                    "n_iter must be an integer >0 when hyperopt "
                    "search spaces are provided. Get None."
                )
            param_distributions = {
                k: p[0] if isinstance(p, list) else p
                for k, p in param_distributions.items()
            }

            return param_distributions, 'hyperopt'

        else:
            raise ValueError(
                "Parameters not recognized. "
                "Pass lists, scipy distributions (also in conjunction "
                "with lists), or hyperopt search spaces."
            )