Repository: cerlymarco/shap-hypetune Branch: main Commit: 8f46e161d27a Files: 11 Total size: 203.6 KB Directory structure: gitextract_liu6w4f0/ ├── .gitignore ├── LICENSE ├── README.md ├── notebooks/ │ ├── LGBM_usage.ipynb │ └── XGBoost_usage.ipynb ├── requirements.txt ├── setup.py └── shaphypetune/ ├── __init__.py ├── _classes.py ├── shaphypetune.py └── utils.py ================================================ FILE CONTENTS ================================================ ================================================ FILE: .gitignore ================================================ .DS_Store # Created by https://www.gitignore.io/api/python ### Python ### # Byte-compiled / optimized / DLL files __pycache__/ *.py[cod] *$py.class # C extensions *.so # Distribution / packaging .Python build/ develop-eggs/ dist/ downloads/ eggs/ .eggs/ lib/ lib64/ parts/ sdist/ var/ wheels/ *.egg-info/ .installed.cfg *.egg # PyInstaller # Usually these files are written by a python script from a template # before PyInstaller builds the exe, so as to inject date/other infos into it. *.manifest *.spec # Installer logs pip-log.txt pip-delete-this-directory.txt # Unit test / coverage reports htmlcov/ .tox/ .coverage .coverage.* .cache nosetests.xml coverage.xml *.cover .hypothesis/ # Translations *.mo *.pot # Django stuff: *.log local_settings.py # Flask stuff: instance/ .webassets-cache # Scrapy stuff: .scrapy # Sphinx documentation docs/_build/ # PyBuilder target/ # Jupyter Notebook .ipynb_checkpoints # pyenv .python-version # celery beat schedule file celerybeat-schedule # SageMath parsed files *.sage.py # Environments .env .venv env/ venv/ ENV/ env.bak/ venv.bak/ # Spyder project settings .spyderproject .spyproject # Rope project settings .ropeproject # mkdocs documentation /site # mypy .mypy_cache/ # End of https://www.gitignore.io/api/python ================================================ FILE: LICENSE ================================================ MIT License Copyright (c) 2021 Marco Cerliani Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. ================================================ FILE: README.md ================================================ # shap-hypetune A python package for simultaneous Hyperparameters Tuning and Features Selection for Gradient Boosting Models. ![shap-hypetune diagram](https://raw.githubusercontent.com/cerlymarco/shap-hypetune/master/imgs/shap-hypetune-diagram.png#center) ## Overview Hyperparameters tuning and features selection are two common steps in every machine learning pipeline. Most of the time they are computed separately and independently. This may result in suboptimal performances and in a more time expensive process. shap-hypetune aims to combine hyperparameters tuning and features selection in a single pipeline optimizing the optimal number of features while searching for the optimal parameters configuration. Hyperparameters Tuning or Features Selection can also be carried out as standalone operations. **shap-hypetune main features:** - designed for gradient boosting models, as LGBModel or XGBModel; - developed to be integrable with the scikit-learn ecosystem; - effective in both classification or regression tasks; - customizable training process, supporting early-stopping and all the other fitting options available in the standard algorithms api; - ranking feature selection algorithms: Recursive Feature Elimination (RFE); Recursive Feature Addition (RFA); or Boruta; - classical boosting based feature importances or SHAP feature importances (the later can be computed also on the eval_set); - apply grid-search, random-search, or bayesian-search (from hyperopt); - parallelized computations with joblib. ## Installation ```shell pip install --upgrade shap-hypetune ``` lightgbm, xgboost are not needed requirements. The module depends only on NumPy, shap, scikit-learn and hyperopt. Python 3.6 or above is supported. ## Media - [SHAP for Feature Selection and HyperParameter Tuning](https://towardsdatascience.com/shap-for-feature-selection-and-hyperparameter-tuning-a330ec0ea104) - [Boruta and SHAP for better Feature Selection](https://towardsdatascience.com/boruta-and-shap-for-better-feature-selection-20ea97595f4a) - [Recursive Feature Selection: Addition or Elimination?](https://towardsdatascience.com/recursive-feature-selection-addition-or-elimination-755e5d86a791) - [Boruta SHAP for Temporal Feature Selection](https://towardsdatascience.com/boruta-shap-for-temporal-feature-selection-96a7840c7713) ## Usage ```python from shaphypetune import BoostSearch, BoostRFE, BoostRFA, BoostBoruta ``` #### Hyperparameters Tuning ```python BoostSearch( estimator, # LGBModel or XGBModel param_grid=None, # parameters to be optimized greater_is_better=False, # minimize or maximize the monitored score n_iter=None, # number of sampled parameter configurations sampling_seed=None, # the seed used for parameter sampling verbose=1, # verbosity mode n_jobs=None # number of jobs to run in parallel ) ``` #### Feature Selection (RFE) ```python BoostRFE( estimator, # LGBModel or XGBModel min_features_to_select=None, # the minimum number of features to be selected step=1, # number of features to remove at each iteration param_grid=None, # parameters to be optimized greater_is_better=False, # minimize or maximize the monitored score importance_type='feature_importances', # which importance measure to use: default or shap train_importance=True, # where to compute the shap feature importance n_iter=None, # number of sampled parameter configurations sampling_seed=None, # the seed used for parameter sampling verbose=1, # verbosity mode n_jobs=None # number of jobs to run in parallel ) ``` #### Feature Selection (BORUTA) ```python BoostBoruta( estimator, # LGBModel or XGBModel perc=100, # threshold used to compare shadow and real features alpha=0.05, # p-value levels for feature rejection max_iter=100, # maximum Boruta iterations to perform early_stopping_boruta_rounds=None, # maximum iterations without confirming a feature param_grid=None, # parameters to be optimized greater_is_better=False, # minimize or maximize the monitored score importance_type='feature_importances', # which importance measure to use: default or shap train_importance=True, # where to compute the shap feature importance n_iter=None, # number of sampled parameter configurations sampling_seed=None, # the seed used for parameter sampling verbose=1, # verbosity mode n_jobs=None # number of jobs to run in parallel ) ``` #### Feature Selection (RFA) ```python BoostRFA( estimator, # LGBModel or XGBModel min_features_to_select=None, # the minimum number of features to be selected step=1, # number of features to remove at each iteration param_grid=None, # parameters to be optimized greater_is_better=False, # minimize or maximize the monitored score importance_type='feature_importances', # which importance measure to use: default or shap train_importance=True, # where to compute the shap feature importance n_iter=None, # number of sampled parameter configurations sampling_seed=None, # the seed used for parameter sampling verbose=1, # verbosity mode n_jobs=None # number of jobs to run in parallel ) ``` Full examples in the [notebooks folder](https://github.com/cerlymarco/shap-hypetune/tree/main/notebooks). ================================================ FILE: notebooks/LGBM_usage.ipynb ================================================ {"metadata":{"kernelspec":{"language":"python","display_name":"Python 3","name":"python3"},"language_info":{"name":"python","version":"3.7.12","mimetype":"text/x-python","codemirror_mode":{"name":"ipython","version":3},"pygments_lexer":"ipython3","nbconvert_exporter":"python","file_extension":".py"}},"nbformat_minor":4,"nbformat":4,"cells":[{"cell_type":"code","source":"import numpy as np\nimport pandas as pd\nfrom scipy import stats\n\nfrom sklearn.model_selection import train_test_split\nfrom sklearn.datasets import make_classification, make_regression\n\nfrom hyperopt import hp\nfrom hyperopt import Trials\n\nfrom lightgbm import *\n\ntry:\n from shaphypetune import BoostSearch, BoostBoruta, BoostRFE, BoostRFA\nexcept:\n !pip install --upgrade shap-hypetune\n from shaphypetune import BoostSearch, BoostBoruta, BoostRFE, BoostRFA\n\nimport warnings\nwarnings.simplefilter('ignore')","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:46:43.363945Z","iopub.execute_input":"2022-01-01T11:46:43.364356Z","iopub.status.idle":"2022-01-01T11:46:45.084134Z","shell.execute_reply.started":"2022-01-01T11:46:43.364243Z","shell.execute_reply":"2022-01-01T11:46:45.083177Z"},"trusted":true},"execution_count":1,"outputs":[{"output_type":"display_data","data":{"text/plain":"","text/html":"\n"},"metadata":{}}]},{"cell_type":"code","source":"X_clf, y_clf = make_classification(n_samples=6000, n_features=20, n_classes=2, \n n_informative=4, n_redundant=6, random_state=0)\n\nX_clf_train, X_clf_valid, y_clf_train, y_clf_valid = train_test_split(\n X_clf, y_clf, test_size=0.3, shuffle=False)\n\nX_regr, y_regr = make_classification(n_samples=6000, n_features=20,\n n_informative=7, random_state=0)\n\nX_regr_train, X_regr_valid, y_regr_train, y_regr_valid = train_test_split(\n X_regr, y_regr, test_size=0.3, shuffle=False)","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:46:45.086875Z","iopub.execute_input":"2022-01-01T11:46:45.087123Z","iopub.status.idle":"2022-01-01T11:46:45.118700Z","shell.execute_reply.started":"2022-01-01T11:46:45.087094Z","shell.execute_reply":"2022-01-01T11:46:45.117983Z"},"trusted":true},"execution_count":2,"outputs":[]},{"cell_type":"code","source":"param_grid = {\n 'learning_rate': [0.2, 0.1],\n 'num_leaves': [25, 35],\n 'max_depth': [10, 12]\n}\n\nparam_dist = {\n 'learning_rate': stats.uniform(0.09, 0.25),\n 'num_leaves': stats.randint(20,40),\n 'max_depth': [10, 12]\n}\n\nparam_dist_hyperopt = {\n 'max_depth': 15 + hp.randint('num_leaves', 5), \n 'learning_rate': hp.loguniform('learning_rate', np.log(0.01), np.log(0.2)),\n 'colsample_bytree': hp.uniform('colsample_by_tree', 0.6, 1.0)\n}\n\n\nregr_lgbm = LGBMRegressor(n_estimators=150, random_state=0, n_jobs=-1)\nclf_lgbm = LGBMClassifier(n_estimators=150, random_state=0, n_jobs=-1)","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:46:45.120073Z","iopub.execute_input":"2022-01-01T11:46:45.120376Z","iopub.status.idle":"2022-01-01T11:46:45.132838Z","shell.execute_reply.started":"2022-01-01T11:46:45.120336Z","shell.execute_reply":"2022-01-01T11:46:45.131615Z"},"trusted":true},"execution_count":3,"outputs":[]},{"cell_type":"markdown","source":"# Hyperparameters Tuning","metadata":{}},{"cell_type":"code","source":"### HYPERPARAM TUNING WITH GRID-SEARCH ###\n\nmodel = BoostSearch(clf_lgbm, param_grid=param_grid)\nmodel.fit(X_clf_train, y_clf_train, eval_set=[(X_clf_valid, y_clf_valid)], early_stopping_rounds=6, verbose=0)","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:46:45.134450Z","iopub.execute_input":"2022-01-01T11:46:45.135435Z","iopub.status.idle":"2022-01-01T11:46:46.383589Z","shell.execute_reply.started":"2022-01-01T11:46:45.135389Z","shell.execute_reply":"2022-01-01T11:46:46.382860Z"},"trusted":true},"execution_count":4,"outputs":[{"name":"stdout","text":"\n8 trials detected for ('learning_rate', 'num_leaves', 'max_depth')\n\ntrial: 0001 ### iterations: 00023 ### eval_score: 0.2085\ntrial: 0002 ### iterations: 00019 ### eval_score: 0.21112\ntrial: 0003 ### iterations: 00026 ### eval_score: 0.21162\ntrial: 0004 ### iterations: 00032 ### eval_score: 0.20747\ntrial: 0005 ### iterations: 00054 ### eval_score: 0.20244\ntrial: 0006 ### iterations: 00071 ### eval_score: 0.20052\ntrial: 0007 ### iterations: 00047 ### eval_score: 0.20306\ntrial: 0008 ### iterations: 00050 ### eval_score: 0.20506\n","output_type":"stream"},{"execution_count":4,"output_type":"execute_result","data":{"text/plain":"BoostSearch(estimator=LGBMClassifier(n_estimators=150, random_state=0),\n param_grid={'learning_rate': [0.2, 0.1], 'max_depth': [10, 12],\n 'num_leaves': [25, 35]})"},"metadata":{}}]},{"cell_type":"code","source":"model.estimator_, model.best_params_, model.best_score_","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:46:46.388010Z","iopub.execute_input":"2022-01-01T11:46:46.389926Z","iopub.status.idle":"2022-01-01T11:46:46.397550Z","shell.execute_reply.started":"2022-01-01T11:46:46.389888Z","shell.execute_reply":"2022-01-01T11:46:46.396658Z"},"trusted":true},"execution_count":5,"outputs":[{"execution_count":5,"output_type":"execute_result","data":{"text/plain":"(LGBMClassifier(max_depth=12, n_estimators=150, num_leaves=25, random_state=0),\n {'learning_rate': 0.1, 'num_leaves': 25, 'max_depth': 12},\n 0.20051586840398297)"},"metadata":{}}]},{"cell_type":"code","source":"(model.score(X_clf_valid, y_clf_valid), \n model.predict(X_clf_valid).shape, \n model.predict_proba(X_clf_valid).shape)","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:46:46.398765Z","iopub.execute_input":"2022-01-01T11:46:46.399534Z","iopub.status.idle":"2022-01-01T11:46:46.436761Z","shell.execute_reply.started":"2022-01-01T11:46:46.399498Z","shell.execute_reply":"2022-01-01T11:46:46.431623Z"},"trusted":true},"execution_count":6,"outputs":[{"execution_count":6,"output_type":"execute_result","data":{"text/plain":"(0.9183333333333333, (1800,), (1800, 2))"},"metadata":{}}]},{"cell_type":"code","source":"### HYPERPARAM TUNING WITH RANDOM-SEARCH ###\n\nmodel = BoostSearch(\n regr_lgbm, param_grid=param_dist,\n n_iter=8, sampling_seed=0\n)\nmodel.fit(X_regr_train, y_regr_train, eval_set=[(X_regr_valid, y_regr_valid)], early_stopping_rounds=6, verbose=0)","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:46:46.438241Z","iopub.execute_input":"2022-01-01T11:46:46.438923Z","iopub.status.idle":"2022-01-01T11:46:47.128794Z","shell.execute_reply.started":"2022-01-01T11:46:46.438892Z","shell.execute_reply":"2022-01-01T11:46:47.128107Z"},"trusted":true},"execution_count":7,"outputs":[{"name":"stdout","text":"\n8 trials detected for ('learning_rate', 'num_leaves', 'max_depth')\n\ntrial: 0001 ### iterations: 00023 ### eval_score: 0.07643\ntrial: 0002 ### iterations: 00052 ### eval_score: 0.06818\ntrial: 0003 ### iterations: 00062 ### eval_score: 0.07042\ntrial: 0004 ### iterations: 00033 ### eval_score: 0.07035\ntrial: 0005 ### iterations: 00032 ### eval_score: 0.07153\ntrial: 0006 ### iterations: 00012 ### eval_score: 0.07547\ntrial: 0007 ### iterations: 00041 ### eval_score: 0.07355\ntrial: 0008 ### iterations: 00025 ### eval_score: 0.07805\n","output_type":"stream"},{"execution_count":7,"output_type":"execute_result","data":{"text/plain":"BoostSearch(estimator=LGBMRegressor(n_estimators=150, random_state=0), n_iter=8,\n param_grid={'learning_rate': ,\n 'max_depth': [10, 12],\n 'num_leaves': },\n sampling_seed=0)"},"metadata":{}}]},{"cell_type":"code","source":"model.estimator_, model.best_params_, model.best_score_","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:46:47.132071Z","iopub.execute_input":"2022-01-01T11:46:47.132611Z","iopub.status.idle":"2022-01-01T11:46:47.142185Z","shell.execute_reply.started":"2022-01-01T11:46:47.132575Z","shell.execute_reply":"2022-01-01T11:46:47.141271Z"},"trusted":true},"execution_count":8,"outputs":[{"execution_count":8,"output_type":"execute_result","data":{"text/plain":"(LGBMRegressor(learning_rate=0.1350674222191923, max_depth=10, n_estimators=150,\n num_leaves=38, random_state=0),\n {'learning_rate': 0.1350674222191923, 'num_leaves': 38, 'max_depth': 10},\n 0.06817737242646997)"},"metadata":{}}]},{"cell_type":"code","source":"(model.score(X_regr_valid, y_regr_valid), \n model.predict(X_regr_valid).shape, \n model.predict(X_regr_valid, pred_contrib=True).shape)","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:46:47.143613Z","iopub.execute_input":"2022-01-01T11:46:47.143856Z","iopub.status.idle":"2022-01-01T11:46:47.611056Z","shell.execute_reply.started":"2022-01-01T11:46:47.143827Z","shell.execute_reply":"2022-01-01T11:46:47.610379Z"},"trusted":true},"execution_count":9,"outputs":[{"execution_count":9,"output_type":"execute_result","data":{"text/plain":"(0.7272820930747703, (1800,), (1800, 21))"},"metadata":{}}]},{"cell_type":"code","source":"### HYPERPARAM TUNING WITH HYPEROPT ###\n\nmodel = BoostSearch(\n regr_lgbm, param_grid=param_dist_hyperopt,\n n_iter=8, sampling_seed=0\n)\nmodel.fit(\n X_regr_train, y_regr_train, trials=Trials(), \n eval_set=[(X_regr_valid, y_regr_valid)], early_stopping_rounds=6, verbose=0\n)","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:46:47.614530Z","iopub.execute_input":"2022-01-01T11:46:47.616779Z","iopub.status.idle":"2022-01-01T11:46:49.268236Z","shell.execute_reply.started":"2022-01-01T11:46:47.616738Z","shell.execute_reply":"2022-01-01T11:46:49.267608Z"},"trusted":true},"execution_count":10,"outputs":[{"name":"stdout","text":"\n8 trials detected for ('max_depth', 'learning_rate', 'colsample_bytree')\n\ntrial: 0001 ### iterations: 00149 ### eval_score: 0.06979\ntrial: 0002 ### iterations: 00055 ### eval_score: 0.07039\ntrial: 0003 ### iterations: 00056 ### eval_score: 0.0716\ntrial: 0004 ### iterations: 00150 ### eval_score: 0.07352\ntrial: 0005 ### iterations: 00150 ### eval_score: 0.07936\ntrial: 0006 ### iterations: 00147 ### eval_score: 0.06833\ntrial: 0007 ### iterations: 00032 ### eval_score: 0.07261\ntrial: 0008 ### iterations: 00096 ### eval_score: 0.07074\n","output_type":"stream"},{"execution_count":10,"output_type":"execute_result","data":{"text/plain":"BoostSearch(estimator=LGBMRegressor(n_estimators=150, random_state=0), n_iter=8,\n param_grid={'colsample_bytree': ,\n 'learning_rate': ,\n 'max_depth': },\n sampling_seed=0)"},"metadata":{}}]},{"cell_type":"code","source":"model.estimator_, model.best_params_, model.best_score_","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:46:49.271739Z","iopub.execute_input":"2022-01-01T11:46:49.272301Z","iopub.status.idle":"2022-01-01T11:46:49.279337Z","shell.execute_reply.started":"2022-01-01T11:46:49.272264Z","shell.execute_reply":"2022-01-01T11:46:49.278727Z"},"trusted":true},"execution_count":11,"outputs":[{"execution_count":11,"output_type":"execute_result","data":{"text/plain":"(LGBMRegressor(colsample_bytree=0.7597292534356749,\n learning_rate=0.059836658149176665, max_depth=16,\n n_estimators=150, random_state=0),\n {'colsample_bytree': 0.7597292534356749,\n 'learning_rate': 0.059836658149176665,\n 'max_depth': 16},\n 0.06832542425080958)"},"metadata":{}}]},{"cell_type":"code","source":"(model.score(X_regr_valid, y_regr_valid), \n model.predict(X_regr_valid).shape, \n model.predict(X_regr_valid, pred_contrib=True).shape)","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:46:49.280499Z","iopub.execute_input":"2022-01-01T11:46:49.280735Z","iopub.status.idle":"2022-01-01T11:46:50.260345Z","shell.execute_reply.started":"2022-01-01T11:46:49.280700Z","shell.execute_reply":"2022-01-01T11:46:50.259694Z"},"trusted":true},"execution_count":12,"outputs":[{"execution_count":12,"output_type":"execute_result","data":{"text/plain":"(0.7266898674988451, (1800,), (1800, 21))"},"metadata":{}}]},{"cell_type":"markdown","source":"# Features Selection","metadata":{}},{"cell_type":"code","source":"### BORUTA ###\n\nmodel = BoostBoruta(clf_lgbm, max_iter=200, perc=100)\nmodel.fit(X_clf_train, y_clf_train, eval_set=[(X_clf_valid, y_clf_valid)], early_stopping_rounds=6, verbose=0)","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:46:50.263726Z","iopub.execute_input":"2022-01-01T11:46:50.265917Z","iopub.status.idle":"2022-01-01T11:46:56.714012Z","shell.execute_reply.started":"2022-01-01T11:46:50.265869Z","shell.execute_reply":"2022-01-01T11:46:56.713278Z"},"trusted":true},"execution_count":13,"outputs":[{"execution_count":13,"output_type":"execute_result","data":{"text/plain":"BoostBoruta(estimator=LGBMClassifier(n_estimators=150, random_state=0),\n max_iter=200)"},"metadata":{}}]},{"cell_type":"code","source":"model.estimator_, model.n_features_","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:46:56.720017Z","iopub.execute_input":"2022-01-01T11:46:56.720486Z","iopub.status.idle":"2022-01-01T11:46:56.727782Z","shell.execute_reply.started":"2022-01-01T11:46:56.720450Z","shell.execute_reply":"2022-01-01T11:46:56.726815Z"},"trusted":true},"execution_count":14,"outputs":[{"execution_count":14,"output_type":"execute_result","data":{"text/plain":"(LGBMClassifier(n_estimators=150, random_state=0), 10)"},"metadata":{}}]},{"cell_type":"code","source":"(model.score(X_clf_valid, y_clf_valid), \n model.predict(X_clf_valid).shape, \n model.transform(X_clf_valid).shape,\n model.predict_proba(X_clf_valid).shape)","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:46:56.730004Z","iopub.execute_input":"2022-01-01T11:46:56.730326Z","iopub.status.idle":"2022-01-01T11:46:56.765852Z","shell.execute_reply.started":"2022-01-01T11:46:56.730286Z","shell.execute_reply":"2022-01-01T11:46:56.760625Z"},"trusted":true},"execution_count":15,"outputs":[{"execution_count":15,"output_type":"execute_result","data":{"text/plain":"(0.91, (1800,), (1800, 10), (1800, 2))"},"metadata":{}}]},{"cell_type":"code","source":"### RECURSIVE FEATURE ELIMINATION (RFE) ###\n\nmodel = BoostRFE(regr_lgbm, min_features_to_select=1, step=1)\nmodel.fit(X_regr_train, y_regr_train, eval_set=[(X_regr_valid, y_regr_valid)], early_stopping_rounds=6, verbose=0)","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:46:56.767160Z","iopub.execute_input":"2022-01-01T11:46:56.767432Z","iopub.status.idle":"2022-01-01T11:46:59.411924Z","shell.execute_reply.started":"2022-01-01T11:46:56.767401Z","shell.execute_reply":"2022-01-01T11:46:59.411240Z"},"trusted":true},"execution_count":16,"outputs":[{"execution_count":16,"output_type":"execute_result","data":{"text/plain":"BoostRFE(estimator=LGBMRegressor(n_estimators=150, random_state=0),\n min_features_to_select=1)"},"metadata":{}}]},{"cell_type":"code","source":"model.estimator_, model.n_features_","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:46:59.415300Z","iopub.execute_input":"2022-01-01T11:46:59.417330Z","iopub.status.idle":"2022-01-01T11:46:59.424201Z","shell.execute_reply.started":"2022-01-01T11:46:59.417288Z","shell.execute_reply":"2022-01-01T11:46:59.423561Z"},"trusted":true},"execution_count":17,"outputs":[{"execution_count":17,"output_type":"execute_result","data":{"text/plain":"(LGBMRegressor(n_estimators=150, random_state=0), 7)"},"metadata":{}}]},{"cell_type":"code","source":"(model.score(X_regr_valid, y_regr_valid), \n model.predict(X_regr_valid).shape, \n model.transform(X_regr_valid).shape,\n model.predict(X_regr_valid, pred_contrib=True).shape)","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:46:59.425449Z","iopub.execute_input":"2022-01-01T11:46:59.425674Z","iopub.status.idle":"2022-01-01T11:47:00.248420Z","shell.execute_reply.started":"2022-01-01T11:46:59.425645Z","shell.execute_reply":"2022-01-01T11:47:00.247703Z"},"trusted":true},"execution_count":18,"outputs":[{"execution_count":18,"output_type":"execute_result","data":{"text/plain":"(0.7766363424352807, (1800,), (1800, 7), (1800, 8))"},"metadata":{}}]},{"cell_type":"code","source":"### RECURSIVE FEATURE ADDITION (RFA) ###\n\nmodel = BoostRFA(regr_lgbm, min_features_to_select=1, step=1)\nmodel.fit(X_regr_train, y_regr_train, eval_set=[(X_regr_valid, y_regr_valid)], early_stopping_rounds=6, verbose=0)","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:47:00.251993Z","iopub.execute_input":"2022-01-01T11:47:00.252510Z","iopub.status.idle":"2022-01-01T11:47:03.954790Z","shell.execute_reply.started":"2022-01-01T11:47:00.252473Z","shell.execute_reply":"2022-01-01T11:47:03.954052Z"},"trusted":true},"execution_count":19,"outputs":[{"execution_count":19,"output_type":"execute_result","data":{"text/plain":"BoostRFA(estimator=LGBMRegressor(n_estimators=150, random_state=0),\n min_features_to_select=1)"},"metadata":{}}]},{"cell_type":"code","source":"model.estimator_, model.n_features_","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:47:03.958397Z","iopub.execute_input":"2022-01-01T11:47:03.958982Z","iopub.status.idle":"2022-01-01T11:47:03.967715Z","shell.execute_reply.started":"2022-01-01T11:47:03.958931Z","shell.execute_reply":"2022-01-01T11:47:03.966909Z"},"trusted":true},"execution_count":20,"outputs":[{"execution_count":20,"output_type":"execute_result","data":{"text/plain":"(LGBMRegressor(n_estimators=150, random_state=0), 8)"},"metadata":{}}]},{"cell_type":"code","source":"(model.score(X_regr_valid, y_regr_valid), \n model.predict(X_regr_valid).shape, \n model.transform(X_regr_valid).shape,\n model.predict(X_regr_valid, pred_contrib=True).shape)","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:47:03.969215Z","iopub.execute_input":"2022-01-01T11:47:03.969612Z","iopub.status.idle":"2022-01-01T11:47:04.838820Z","shell.execute_reply.started":"2022-01-01T11:47:03.969569Z","shell.execute_reply":"2022-01-01T11:47:04.838192Z"},"trusted":true},"execution_count":21,"outputs":[{"execution_count":21,"output_type":"execute_result","data":{"text/plain":"(0.7723191919698336, (1800,), (1800, 8), (1800, 9))"},"metadata":{}}]},{"cell_type":"markdown","source":"# Features Selection with SHAP","metadata":{}},{"cell_type":"code","source":"### BORUTA SHAP ###\n\nmodel = BoostBoruta(\n clf_lgbm, max_iter=200, perc=100,\n importance_type='shap_importances', train_importance=False\n)\nmodel.fit(X_clf_train, y_clf_train, eval_set=[(X_clf_valid, y_clf_valid)], early_stopping_rounds=6, verbose=0)","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:47:04.842289Z","iopub.execute_input":"2022-01-01T11:47:04.844564Z","iopub.status.idle":"2022-01-01T11:47:17.780389Z","shell.execute_reply.started":"2022-01-01T11:47:04.844522Z","shell.execute_reply":"2022-01-01T11:47:17.779726Z"},"trusted":true},"execution_count":22,"outputs":[{"execution_count":22,"output_type":"execute_result","data":{"text/plain":"BoostBoruta(estimator=LGBMClassifier(n_estimators=150, random_state=0),\n importance_type='shap_importances', max_iter=200,\n train_importance=False)"},"metadata":{}}]},{"cell_type":"code","source":"model.estimator_, model.n_features_","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:47:17.781535Z","iopub.execute_input":"2022-01-01T11:47:17.784569Z","iopub.status.idle":"2022-01-01T11:47:17.791371Z","shell.execute_reply.started":"2022-01-01T11:47:17.784530Z","shell.execute_reply":"2022-01-01T11:47:17.790591Z"},"trusted":true},"execution_count":23,"outputs":[{"execution_count":23,"output_type":"execute_result","data":{"text/plain":"(LGBMClassifier(n_estimators=150, random_state=0), 9)"},"metadata":{}}]},{"cell_type":"code","source":"(model.score(X_clf_valid, y_clf_valid), \n model.predict(X_clf_valid).shape, \n model.transform(X_clf_valid).shape,\n model.predict_proba(X_clf_valid).shape)","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:47:17.794450Z","iopub.execute_input":"2022-01-01T11:47:17.794986Z","iopub.status.idle":"2022-01-01T11:47:17.813842Z","shell.execute_reply.started":"2022-01-01T11:47:17.794933Z","shell.execute_reply":"2022-01-01T11:47:17.813126Z"},"trusted":true},"execution_count":24,"outputs":[{"execution_count":24,"output_type":"execute_result","data":{"text/plain":"(0.9111111111111111, (1800,), (1800, 9), (1800, 2))"},"metadata":{}}]},{"cell_type":"code","source":"### RECURSIVE FEATURE ELIMINATION (RFE) SHAP ###\n\nmodel = BoostRFE(\n regr_lgbm, min_features_to_select=1, step=1,\n importance_type='shap_importances', train_importance=False\n)\nmodel.fit(X_regr_train, y_regr_train, eval_set=[(X_regr_valid, y_regr_valid)], early_stopping_rounds=6, verbose=0)","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:47:17.817477Z","iopub.execute_input":"2022-01-01T11:47:17.819641Z","iopub.status.idle":"2022-01-01T11:47:32.735329Z","shell.execute_reply.started":"2022-01-01T11:47:17.819595Z","shell.execute_reply":"2022-01-01T11:47:32.734687Z"},"trusted":true},"execution_count":25,"outputs":[{"execution_count":25,"output_type":"execute_result","data":{"text/plain":"BoostRFE(estimator=LGBMRegressor(n_estimators=150, random_state=0),\n importance_type='shap_importances', min_features_to_select=1,\n train_importance=False)"},"metadata":{}}]},{"cell_type":"code","source":"model.estimator_, model.n_features_","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:47:32.736646Z","iopub.execute_input":"2022-01-01T11:47:32.737109Z","iopub.status.idle":"2022-01-01T11:47:32.743398Z","shell.execute_reply.started":"2022-01-01T11:47:32.737074Z","shell.execute_reply":"2022-01-01T11:47:32.742747Z"},"trusted":true},"execution_count":26,"outputs":[{"execution_count":26,"output_type":"execute_result","data":{"text/plain":"(LGBMRegressor(n_estimators=150, random_state=0), 7)"},"metadata":{}}]},{"cell_type":"code","source":"(model.score(X_regr_valid, y_regr_valid), \n model.predict(X_regr_valid).shape, \n model.transform(X_regr_valid).shape,\n model.predict(X_regr_valid, pred_contrib=True).shape)","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:47:32.744765Z","iopub.execute_input":"2022-01-01T11:47:32.747374Z","iopub.status.idle":"2022-01-01T11:47:33.570515Z","shell.execute_reply.started":"2022-01-01T11:47:32.747336Z","shell.execute_reply":"2022-01-01T11:47:33.569899Z"},"trusted":true},"execution_count":27,"outputs":[{"execution_count":27,"output_type":"execute_result","data":{"text/plain":"(0.7766363424352807, (1800,), (1800, 7), (1800, 8))"},"metadata":{}}]},{"cell_type":"code","source":"### RECURSIVE FEATURE ADDITION (RFA) SHAP ###\n\nmodel = BoostRFA(\n regr_lgbm, min_features_to_select=1, step=1,\n importance_type='shap_importances', train_importance=False\n)\nmodel.fit(X_regr_train, y_regr_train, eval_set=[(X_regr_valid, y_regr_valid)], early_stopping_rounds=6, verbose=0)","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:47:33.571778Z","iopub.execute_input":"2022-01-01T11:47:33.572261Z","iopub.status.idle":"2022-01-01T11:47:39.941084Z","shell.execute_reply.started":"2022-01-01T11:47:33.572226Z","shell.execute_reply":"2022-01-01T11:47:39.940356Z"},"trusted":true},"execution_count":28,"outputs":[{"execution_count":28,"output_type":"execute_result","data":{"text/plain":"BoostRFA(estimator=LGBMRegressor(n_estimators=150, random_state=0),\n importance_type='shap_importances', min_features_to_select=1,\n train_importance=False)"},"metadata":{}}]},{"cell_type":"code","source":"model.estimator_, model.n_features_","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:47:39.944497Z","iopub.execute_input":"2022-01-01T11:47:39.946592Z","iopub.status.idle":"2022-01-01T11:47:39.953717Z","shell.execute_reply.started":"2022-01-01T11:47:39.946550Z","shell.execute_reply":"2022-01-01T11:47:39.952924Z"},"trusted":true},"execution_count":29,"outputs":[{"execution_count":29,"output_type":"execute_result","data":{"text/plain":"(LGBMRegressor(n_estimators=150, random_state=0), 9)"},"metadata":{}}]},{"cell_type":"code","source":"(model.score(X_regr_valid, y_regr_valid), \n model.predict(X_regr_valid).shape, \n model.transform(X_regr_valid).shape,\n model.predict(X_regr_valid, pred_contrib=True).shape)","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:47:39.954955Z","iopub.execute_input":"2022-01-01T11:47:39.955713Z","iopub.status.idle":"2022-01-01T11:47:40.853749Z","shell.execute_reply.started":"2022-01-01T11:47:39.955669Z","shell.execute_reply":"2022-01-01T11:47:40.853100Z"},"trusted":true},"execution_count":30,"outputs":[{"execution_count":30,"output_type":"execute_result","data":{"text/plain":"(0.7699366468805918, (1800,), (1800, 9), (1800, 10))"},"metadata":{}}]},{"cell_type":"markdown","source":"# Hyperparameters Tuning + Features Selection","metadata":{}},{"cell_type":"code","source":"### HYPERPARAM TUNING WITH GRID-SEARCH + BORUTA ###\n\nmodel = BoostBoruta(clf_lgbm, param_grid=param_grid, max_iter=200, perc=100)\nmodel.fit(X_clf_train, y_clf_train, eval_set=[(X_clf_valid, y_clf_valid)], early_stopping_rounds=6, verbose=0)","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:47:40.857000Z","iopub.execute_input":"2022-01-01T11:47:40.859123Z","iopub.status.idle":"2022-01-01T11:48:08.045782Z","shell.execute_reply.started":"2022-01-01T11:47:40.859074Z","shell.execute_reply":"2022-01-01T11:48:08.043191Z"},"trusted":true},"execution_count":31,"outputs":[{"name":"stdout","text":"\n8 trials detected for ('learning_rate', 'num_leaves', 'max_depth')\n\ntrial: 0001 ### iterations: 00023 ### eval_score: 0.19868\ntrial: 0002 ### iterations: 00030 ### eval_score: 0.19844\ntrial: 0003 ### iterations: 00023 ### eval_score: 0.19695\ntrial: 0004 ### iterations: 00026 ### eval_score: 0.19949\ntrial: 0005 ### iterations: 00067 ### eval_score: 0.19583\ntrial: 0006 ### iterations: 00051 ### eval_score: 0.1949\ntrial: 0007 ### iterations: 00045 ### eval_score: 0.19675\ntrial: 0008 ### iterations: 00055 ### eval_score: 0.19906\n","output_type":"stream"},{"execution_count":31,"output_type":"execute_result","data":{"text/plain":"BoostBoruta(estimator=LGBMClassifier(n_estimators=150, random_state=0),\n max_iter=200,\n param_grid={'learning_rate': [0.2, 0.1], 'max_depth': [10, 12],\n 'num_leaves': [25, 35]})"},"metadata":{}}]},{"cell_type":"code","source":"model.estimator_, model.best_params_, model.best_score_, model.n_features_","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:48:08.047190Z","iopub.execute_input":"2022-01-01T11:48:08.048047Z","iopub.status.idle":"2022-01-01T11:48:08.056353Z","shell.execute_reply.started":"2022-01-01T11:48:08.048000Z","shell.execute_reply":"2022-01-01T11:48:08.055615Z"},"trusted":true},"execution_count":32,"outputs":[{"execution_count":32,"output_type":"execute_result","data":{"text/plain":"(LGBMClassifier(max_depth=12, n_estimators=150, num_leaves=25, random_state=0),\n {'learning_rate': 0.1, 'num_leaves': 25, 'max_depth': 12},\n 0.19489866976777023,\n 9)"},"metadata":{}}]},{"cell_type":"code","source":"(model.score(X_clf_valid, y_clf_valid), \n model.predict(X_clf_valid).shape, \n model.transform(X_clf_valid).shape,\n model.predict_proba(X_clf_valid).shape)","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:48:08.058015Z","iopub.execute_input":"2022-01-01T11:48:08.058593Z","iopub.status.idle":"2022-01-01T11:48:08.109632Z","shell.execute_reply.started":"2022-01-01T11:48:08.058410Z","shell.execute_reply":"2022-01-01T11:48:08.108670Z"},"trusted":true},"execution_count":33,"outputs":[{"execution_count":33,"output_type":"execute_result","data":{"text/plain":"(0.915, (1800,), (1800, 9), (1800, 2))"},"metadata":{}}]},{"cell_type":"code","source":"### HYPERPARAM TUNING WITH RANDOM-SEARCH + RECURSIVE FEATURE ELIMINATION (RFE) ###\n\nmodel = BoostRFE(\n regr_lgbm, param_grid=param_dist, min_features_to_select=1, step=1,\n n_iter=8, sampling_seed=0\n)\nmodel.fit(X_regr_train, y_regr_train, eval_set=[(X_regr_valid, y_regr_valid)], early_stopping_rounds=6, verbose=0)","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:48:08.114460Z","iopub.execute_input":"2022-01-01T11:48:08.116626Z","iopub.status.idle":"2022-01-01T11:48:20.506235Z","shell.execute_reply.started":"2022-01-01T11:48:08.116579Z","shell.execute_reply":"2022-01-01T11:48:20.505511Z"},"trusted":true},"execution_count":34,"outputs":[{"name":"stdout","text":"\n8 trials detected for ('learning_rate', 'num_leaves', 'max_depth')\n\ntrial: 0001 ### iterations: 00107 ### eval_score: 0.06016\ntrial: 0002 ### iterations: 00095 ### eval_score: 0.05711\ntrial: 0003 ### iterations: 00121 ### eval_score: 0.05926\ntrial: 0004 ### iterations: 00103 ### eval_score: 0.05688\ntrial: 0005 ### iterations: 00119 ### eval_score: 0.05618\ntrial: 0006 ### iterations: 00049 ### eval_score: 0.06188\ntrial: 0007 ### iterations: 00150 ### eval_score: 0.05538\ntrial: 0008 ### iterations: 00083 ### eval_score: 0.06084\n","output_type":"stream"},{"execution_count":34,"output_type":"execute_result","data":{"text/plain":"BoostRFE(estimator=LGBMRegressor(n_estimators=150, random_state=0),\n min_features_to_select=1, n_iter=8,\n param_grid={'learning_rate': ,\n 'max_depth': [10, 12],\n 'num_leaves': },\n sampling_seed=0)"},"metadata":{}}]},{"cell_type":"code","source":"model.estimator_, model.best_params_, model.best_score_, model.n_features_","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:48:20.509788Z","iopub.execute_input":"2022-01-01T11:48:20.511633Z","iopub.status.idle":"2022-01-01T11:48:20.521139Z","shell.execute_reply.started":"2022-01-01T11:48:20.511592Z","shell.execute_reply":"2022-01-01T11:48:20.520293Z"},"trusted":true},"execution_count":35,"outputs":[{"execution_count":35,"output_type":"execute_result","data":{"text/plain":"(LGBMRegressor(learning_rate=0.13639381870463482, max_depth=12, n_estimators=150,\n num_leaves=25, random_state=0),\n {'learning_rate': 0.13639381870463482, 'num_leaves': 25, 'max_depth': 12},\n 0.0553821617278472,\n 7)"},"metadata":{}}]},{"cell_type":"code","source":"(model.score(X_regr_valid, y_regr_valid), \n model.predict(X_regr_valid).shape, \n model.transform(X_regr_valid).shape,\n model.predict(X_regr_valid, pred_contrib=True).shape)","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:48:20.522443Z","iopub.execute_input":"2022-01-01T11:48:20.522817Z","iopub.status.idle":"2022-01-01T11:48:21.145683Z","shell.execute_reply.started":"2022-01-01T11:48:20.522785Z","shell.execute_reply":"2022-01-01T11:48:21.145033Z"},"trusted":true},"execution_count":36,"outputs":[{"execution_count":36,"output_type":"execute_result","data":{"text/plain":"(0.7784645155736596, (1800,), (1800, 7), (1800, 8))"},"metadata":{}}]},{"cell_type":"code","source":"### HYPERPARAM TUNING WITH HYPEROPT + RECURSIVE FEATURE ADDITION (RFA) ###\n\nmodel = BoostRFA(\n regr_lgbm, param_grid=param_dist_hyperopt, min_features_to_select=1, step=1,\n n_iter=8, sampling_seed=0\n)\nmodel.fit(\n X_regr_train, y_regr_train, trials=Trials(), \n eval_set=[(X_regr_valid, y_regr_valid)], early_stopping_rounds=6, verbose=0\n)","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:48:21.149492Z","iopub.execute_input":"2022-01-01T11:48:21.151302Z","iopub.status.idle":"2022-01-01T11:48:56.679453Z","shell.execute_reply.started":"2022-01-01T11:48:21.151261Z","shell.execute_reply":"2022-01-01T11:48:56.678720Z"},"trusted":true},"execution_count":37,"outputs":[{"name":"stdout","text":"\n8 trials detected for ('max_depth', 'learning_rate', 'colsample_bytree')\n\ntrial: 0001 ### iterations: 00150 ### eval_score: 0.06507\ntrial: 0002 ### iterations: 00075 ### eval_score: 0.05784\ntrial: 0003 ### iterations: 00095 ### eval_score: 0.06088\ntrial: 0004 ### iterations: 00150 ### eval_score: 0.06976\ntrial: 0005 ### iterations: 00150 ### eval_score: 0.07593\ntrial: 0006 ### iterations: 00149 ### eval_score: 0.05995\ntrial: 0007 ### iterations: 00058 ### eval_score: 0.05916\ntrial: 0008 ### iterations: 00150 ### eval_score: 0.06366\n","output_type":"stream"},{"execution_count":37,"output_type":"execute_result","data":{"text/plain":"BoostRFA(estimator=LGBMRegressor(n_estimators=150, random_state=0),\n min_features_to_select=1, n_iter=8,\n param_grid={'colsample_bytree': ,\n 'learning_rate': ,\n 'max_depth': },\n sampling_seed=0)"},"metadata":{}}]},{"cell_type":"code","source":"model.estimator_, model.best_params_, model.best_score_, model.n_features_","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:48:56.682847Z","iopub.execute_input":"2022-01-01T11:48:56.684405Z","iopub.status.idle":"2022-01-01T11:48:56.691812Z","shell.execute_reply.started":"2022-01-01T11:48:56.684368Z","shell.execute_reply":"2022-01-01T11:48:56.690932Z"},"trusted":true},"execution_count":38,"outputs":[{"execution_count":38,"output_type":"execute_result","data":{"text/plain":"(LGBMRegressor(colsample_bytree=0.8515260655364685,\n learning_rate=0.13520045129619862, max_depth=18, n_estimators=150,\n random_state=0),\n {'colsample_bytree': 0.8515260655364685,\n 'learning_rate': 0.13520045129619862,\n 'max_depth': 18},\n 0.0578369356489881,\n 8)"},"metadata":{}}]},{"cell_type":"code","source":"(model.score(X_regr_valid, y_regr_valid), \n model.predict(X_regr_valid).shape, \n model.transform(X_regr_valid).shape,\n model.predict(X_regr_valid, pred_contrib=True).shape)","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:48:56.693078Z","iopub.execute_input":"2022-01-01T11:48:56.693305Z","iopub.status.idle":"2022-01-01T11:48:57.115924Z","shell.execute_reply.started":"2022-01-01T11:48:56.693277Z","shell.execute_reply":"2022-01-01T11:48:57.115308Z"},"trusted":true},"execution_count":39,"outputs":[{"execution_count":39,"output_type":"execute_result","data":{"text/plain":"(0.7686451168212334, (1800,), (1800, 8), (1800, 9))"},"metadata":{}}]},{"cell_type":"markdown","source":"# Hyperparameters Tuning + Features Selection with SHAP","metadata":{}},{"cell_type":"code","source":"### HYPERPARAM TUNING WITH GRID-SEARCH + BORUTA SHAP ###\n\nmodel = BoostBoruta(\n clf_lgbm, param_grid=param_grid, max_iter=200, perc=100,\n importance_type='shap_importances', train_importance=False\n)\nmodel.fit(X_clf_train, y_clf_train, eval_set=[(X_clf_valid, y_clf_valid)], early_stopping_rounds=6, verbose=0)","metadata":{"scrolled":true,"execution":{"iopub.status.busy":"2022-01-01T11:48:57.119397Z","iopub.execute_input":"2022-01-01T11:48:57.120009Z","iopub.status.idle":"2022-01-01T11:50:15.982498Z","shell.execute_reply.started":"2022-01-01T11:48:57.119958Z","shell.execute_reply":"2022-01-01T11:50:15.981774Z"},"trusted":true},"execution_count":40,"outputs":[{"name":"stdout","text":"\n8 trials detected for ('learning_rate', 'num_leaves', 'max_depth')\n\ntrial: 0001 ### iterations: 00036 ### eval_score: 0.19716\ntrial: 0002 ### iterations: 00030 ### eval_score: 0.19818\ntrial: 0003 ### iterations: 00031 ### eval_score: 0.19881\ntrial: 0004 ### iterations: 00026 ### eval_score: 0.19949\ntrial: 0005 ### iterations: 00067 ### eval_score: 0.19583\ntrial: 0006 ### iterations: 00051 ### eval_score: 0.1949\ntrial: 0007 ### iterations: 00045 ### eval_score: 0.19675\ntrial: 0008 ### iterations: 00057 ### eval_score: 0.19284\n","output_type":"stream"},{"execution_count":40,"output_type":"execute_result","data":{"text/plain":"BoostBoruta(estimator=LGBMClassifier(n_estimators=150, random_state=0),\n importance_type='shap_importances', max_iter=200,\n param_grid={'learning_rate': [0.2, 0.1], 'max_depth': [10, 12],\n 'num_leaves': [25, 35]},\n train_importance=False)"},"metadata":{}}]},{"cell_type":"code","source":"model.estimator_, model.best_params_, model.best_score_, model.n_features_","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:50:15.988196Z","iopub.execute_input":"2022-01-01T11:50:15.988729Z","iopub.status.idle":"2022-01-01T11:50:15.996898Z","shell.execute_reply.started":"2022-01-01T11:50:15.988685Z","shell.execute_reply":"2022-01-01T11:50:15.996175Z"},"trusted":true},"execution_count":41,"outputs":[{"execution_count":41,"output_type":"execute_result","data":{"text/plain":"(LGBMClassifier(max_depth=12, n_estimators=150, num_leaves=35, random_state=0),\n {'learning_rate': 0.1, 'num_leaves': 35, 'max_depth': 12},\n 0.1928371931511303,\n 10)"},"metadata":{}}]},{"cell_type":"code","source":"(model.score(X_clf_valid, y_clf_valid), \n model.predict(X_clf_valid).shape, \n model.transform(X_clf_valid).shape,\n model.predict_proba(X_clf_valid).shape)","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:50:15.998631Z","iopub.execute_input":"2022-01-01T11:50:15.999269Z","iopub.status.idle":"2022-01-01T11:50:16.029050Z","shell.execute_reply.started":"2022-01-01T11:50:15.999228Z","shell.execute_reply":"2022-01-01T11:50:16.028270Z"},"trusted":true},"execution_count":42,"outputs":[{"execution_count":42,"output_type":"execute_result","data":{"text/plain":"(0.9111111111111111, (1800,), (1800, 10), (1800, 2))"},"metadata":{}}]},{"cell_type":"code","source":"### HYPERPARAM TUNING WITH RANDOM-SEARCH + RECURSIVE FEATURE ELIMINATION (RFE) SHAP ###\n\nmodel = BoostRFE(\n regr_lgbm, param_grid=param_dist, min_features_to_select=1, step=1,\n n_iter=8, sampling_seed=0,\n importance_type='shap_importances', train_importance=False\n)\nmodel.fit(X_regr_train, y_regr_train, eval_set=[(X_regr_valid, y_regr_valid)], early_stopping_rounds=6, verbose=0)","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:50:16.030261Z","iopub.execute_input":"2022-01-01T11:50:16.030658Z","iopub.status.idle":"2022-01-01T11:51:19.095150Z","shell.execute_reply.started":"2022-01-01T11:50:16.030625Z","shell.execute_reply":"2022-01-01T11:51:19.094483Z"},"trusted":true},"execution_count":43,"outputs":[{"name":"stdout","text":"\n8 trials detected for ('learning_rate', 'num_leaves', 'max_depth')\n\ntrial: 0001 ### iterations: 00107 ### eval_score: 0.06016\ntrial: 0002 ### iterations: 00102 ### eval_score: 0.05525\ntrial: 0003 ### iterations: 00150 ### eval_score: 0.05869\ntrial: 0004 ### iterations: 00149 ### eval_score: 0.05863\ntrial: 0005 ### iterations: 00119 ### eval_score: 0.05618\ntrial: 0006 ### iterations: 00049 ### eval_score: 0.06188\ntrial: 0007 ### iterations: 00150 ### eval_score: 0.05538\ntrial: 0008 ### iterations: 00083 ### eval_score: 0.06084\n","output_type":"stream"},{"execution_count":43,"output_type":"execute_result","data":{"text/plain":"BoostRFE(estimator=LGBMRegressor(n_estimators=150, random_state=0),\n importance_type='shap_importances', min_features_to_select=1, n_iter=8,\n param_grid={'learning_rate': ,\n 'max_depth': [10, 12],\n 'num_leaves': },\n sampling_seed=0, train_importance=False)"},"metadata":{}}]},{"cell_type":"code","source":"model.estimator_, model.best_params_, model.best_score_, model.n_features_","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:51:19.098487Z","iopub.execute_input":"2022-01-01T11:51:19.099062Z","iopub.status.idle":"2022-01-01T11:51:19.108772Z","shell.execute_reply.started":"2022-01-01T11:51:19.099027Z","shell.execute_reply":"2022-01-01T11:51:19.107939Z"},"trusted":true},"execution_count":44,"outputs":[{"execution_count":44,"output_type":"execute_result","data":{"text/plain":"(LGBMRegressor(learning_rate=0.1350674222191923, max_depth=10, n_estimators=150,\n num_leaves=38, random_state=0),\n {'learning_rate': 0.1350674222191923, 'num_leaves': 38, 'max_depth': 10},\n 0.05524518772497125,\n 9)"},"metadata":{}}]},{"cell_type":"code","source":"(model.score(X_regr_valid, y_regr_valid), \n model.predict(X_regr_valid).shape, \n model.transform(X_regr_valid).shape,\n model.predict(X_regr_valid, pred_contrib=True).shape)","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:51:19.110141Z","iopub.execute_input":"2022-01-01T11:51:19.110358Z","iopub.status.idle":"2022-01-01T11:51:19.840667Z","shell.execute_reply.started":"2022-01-01T11:51:19.110333Z","shell.execute_reply":"2022-01-01T11:51:19.840035Z"},"trusted":true},"execution_count":45,"outputs":[{"execution_count":45,"output_type":"execute_result","data":{"text/plain":"(0.779012428496056, (1800,), (1800, 9), (1800, 10))"},"metadata":{}}]},{"cell_type":"code","source":"### HYPERPARAM TUNING WITH HYPEROPT + RECURSIVE FEATURE ADDITION (RFA) SHAP ###\n\nmodel = BoostRFA(\n regr_lgbm, param_grid=param_dist_hyperopt, min_features_to_select=1, step=1,\n n_iter=8, sampling_seed=0,\n importance_type='shap_importances', train_importance=False\n)\nmodel.fit(\n X_regr_train, y_regr_train, trials=Trials(), \n eval_set=[(X_regr_valid, y_regr_valid)], early_stopping_rounds=6, verbose=0\n)","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:51:19.844245Z","iopub.execute_input":"2022-01-01T11:51:19.844839Z","iopub.status.idle":"2022-01-01T11:52:27.830673Z","shell.execute_reply.started":"2022-01-01T11:51:19.844800Z","shell.execute_reply":"2022-01-01T11:52:27.829915Z"},"trusted":true},"execution_count":46,"outputs":[{"name":"stdout","text":"\n8 trials detected for ('max_depth', 'learning_rate', 'colsample_bytree')\n\ntrial: 0001 ### iterations: 00150 ### eval_score: 0.06508\ntrial: 0002 ### iterations: 00091 ### eval_score: 0.05997\ntrial: 0003 ### iterations: 00094 ### eval_score: 0.06078\ntrial: 0004 ### iterations: 00150 ### eval_score: 0.06773\ntrial: 0005 ### iterations: 00150 ### eval_score: 0.07565\ntrial: 0006 ### iterations: 00150 ### eval_score: 0.05935\ntrial: 0007 ### iterations: 00083 ### eval_score: 0.06047\ntrial: 0008 ### iterations: 00150 ### eval_score: 0.05966\n","output_type":"stream"},{"execution_count":46,"output_type":"execute_result","data":{"text/plain":"BoostRFA(estimator=LGBMRegressor(n_estimators=150, random_state=0),\n importance_type='shap_importances', min_features_to_select=1, n_iter=8,\n param_grid={'colsample_bytree': ,\n 'learning_rate': ,\n 'max_depth': },\n sampling_seed=0, train_importance=False)"},"metadata":{}}]},{"cell_type":"code","source":"model.estimator_, model.best_params_, model.best_score_, model.n_features_","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:52:27.834402Z","iopub.execute_input":"2022-01-01T11:52:27.835864Z","iopub.status.idle":"2022-01-01T11:52:27.842813Z","shell.execute_reply.started":"2022-01-01T11:52:27.835812Z","shell.execute_reply":"2022-01-01T11:52:27.842095Z"},"trusted":true},"execution_count":47,"outputs":[{"execution_count":47,"output_type":"execute_result","data":{"text/plain":"(LGBMRegressor(colsample_bytree=0.7597292534356749,\n learning_rate=0.059836658149176665, max_depth=16,\n n_estimators=150, random_state=0),\n {'colsample_bytree': 0.7597292534356749,\n 'learning_rate': 0.059836658149176665,\n 'max_depth': 16},\n 0.059352961644604275,\n 9)"},"metadata":{}}]},{"cell_type":"code","source":"(model.score(X_regr_valid, y_regr_valid), \n model.predict(X_regr_valid).shape, \n model.transform(X_regr_valid).shape,\n model.predict(X_regr_valid, pred_contrib=True).shape)","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:52:27.844085Z","iopub.execute_input":"2022-01-01T11:52:27.844320Z","iopub.status.idle":"2022-01-01T11:52:28.690931Z","shell.execute_reply.started":"2022-01-01T11:52:27.844291Z","shell.execute_reply":"2022-01-01T11:52:28.690302Z"},"trusted":true},"execution_count":48,"outputs":[{"execution_count":48,"output_type":"execute_result","data":{"text/plain":"(0.7625808256692885, (1800,), (1800, 9), (1800, 10))"},"metadata":{}}]},{"cell_type":"markdown","source":"# CUSTOM EVAL METRIC SUPPORT","metadata":{}},{"cell_type":"code","source":"from sklearn.metrics import roc_auc_score\n\ndef AUC(y_true, y_hat):\n return 'auc', roc_auc_score(y_true, y_hat), True","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:52:28.691909Z","iopub.execute_input":"2022-01-01T11:52:28.692560Z","iopub.status.idle":"2022-01-01T11:52:28.696813Z","shell.execute_reply.started":"2022-01-01T11:52:28.692526Z","shell.execute_reply":"2022-01-01T11:52:28.696058Z"},"trusted":true},"execution_count":49,"outputs":[]},{"cell_type":"code","source":"model = BoostRFE(\n LGBMClassifier(n_estimators=150, random_state=0, metric=\"custom\"), \n param_grid=param_grid, min_features_to_select=1, step=1,\n greater_is_better=True\n)\nmodel.fit(\n X_clf_train, y_clf_train, \n eval_set=[(X_clf_valid, y_clf_valid)], early_stopping_rounds=6, verbose=0, \n eval_metric=AUC\n)","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:52:28.700234Z","iopub.execute_input":"2022-01-01T11:52:28.700461Z","iopub.status.idle":"2022-01-01T11:52:49.577997Z","shell.execute_reply.started":"2022-01-01T11:52:28.700433Z","shell.execute_reply":"2022-01-01T11:52:49.577317Z"},"trusted":true},"execution_count":50,"outputs":[{"name":"stdout","text":"\n8 trials detected for ('learning_rate', 'num_leaves', 'max_depth')\n\ntrial: 0001 ### iterations: 00028 ### eval_score: 0.97581\ntrial: 0002 ### iterations: 00016 ### eval_score: 0.97514\ntrial: 0003 ### iterations: 00015 ### eval_score: 0.97574\ntrial: 0004 ### iterations: 00032 ### eval_score: 0.97549\ntrial: 0005 ### iterations: 00075 ### eval_score: 0.97551\ntrial: 0006 ### iterations: 00041 ### eval_score: 0.97597\ntrial: 0007 ### iterations: 00076 ### eval_score: 0.97592\ntrial: 0008 ### iterations: 00060 ### eval_score: 0.97539\n","output_type":"stream"},{"execution_count":50,"output_type":"execute_result","data":{"text/plain":"BoostRFE(estimator=LGBMClassifier(metric='custom', n_estimators=150,\n random_state=0),\n greater_is_better=True, min_features_to_select=1,\n param_grid={'learning_rate': [0.2, 0.1], 'max_depth': [10, 12],\n 'num_leaves': [25, 35]})"},"metadata":{}}]},{"cell_type":"markdown","source":"# CATEGORICAL FEATURE SUPPORT","metadata":{}},{"cell_type":"code","source":"categorical_feature = [0,1,2]\n\nX_clf_train[:,categorical_feature] = (X_clf_train[:,categorical_feature]+100).clip(0).astype(int)\nX_clf_valid[:,categorical_feature] = (X_clf_valid[:,categorical_feature]+100).clip(0).astype(int)","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:52:49.581409Z","iopub.execute_input":"2022-01-01T11:52:49.581982Z","iopub.status.idle":"2022-01-01T11:52:49.589315Z","shell.execute_reply.started":"2022-01-01T11:52:49.581931Z","shell.execute_reply":"2022-01-01T11:52:49.588511Z"},"trusted":true},"execution_count":51,"outputs":[]},{"cell_type":"code","source":"### MANUAL PASS categorical_feature WITH NUMPY ARRAYS ###\n\nmodel = BoostRFE(clf_lgbm, param_grid=param_grid, min_features_to_select=1, step=1)\nmodel.fit(\n X_clf_train, y_clf_train, \n eval_set=[(X_clf_valid, y_clf_valid)], early_stopping_rounds=6, verbose=0,\n categorical_feature=categorical_feature\n)","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:52:49.590366Z","iopub.execute_input":"2022-01-01T11:52:49.590604Z","iopub.status.idle":"2022-01-01T11:53:00.495917Z","shell.execute_reply.started":"2022-01-01T11:52:49.590576Z","shell.execute_reply":"2022-01-01T11:53:00.495224Z"},"trusted":true},"execution_count":52,"outputs":[{"name":"stdout","text":"\n8 trials detected for ('learning_rate', 'num_leaves', 'max_depth')\n\ntrial: 0001 ### iterations: 00029 ### eval_score: 0.2036\ntrial: 0002 ### iterations: 00030 ### eval_score: 0.2034\ntrial: 0003 ### iterations: 00027 ### eval_score: 0.20617\ntrial: 0004 ### iterations: 00024 ### eval_score: 0.20003\ntrial: 0005 ### iterations: 00060 ### eval_score: 0.20332\ntrial: 0006 ### iterations: 00063 ### eval_score: 0.20329\ntrial: 0007 ### iterations: 00054 ### eval_score: 0.20136\ntrial: 0008 ### iterations: 00052 ### eval_score: 0.19959\n","output_type":"stream"},{"execution_count":52,"output_type":"execute_result","data":{"text/plain":"BoostRFE(estimator=LGBMClassifier(n_estimators=150, random_state=0),\n min_features_to_select=1,\n param_grid={'learning_rate': [0.2, 0.1], 'max_depth': [10, 12],\n 'num_leaves': [25, 35]})"},"metadata":{}}]},{"cell_type":"code","source":"X_clf_train = pd.DataFrame(X_clf_train)\nX_clf_train[categorical_feature] = X_clf_train[categorical_feature].astype('category')\n\nX_clf_valid = pd.DataFrame(X_clf_valid)\nX_clf_valid[categorical_feature] = X_clf_valid[categorical_feature].astype('category')","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:53:00.499198Z","iopub.execute_input":"2022-01-01T11:53:00.499858Z","iopub.status.idle":"2022-01-01T11:53:00.527402Z","shell.execute_reply.started":"2022-01-01T11:53:00.499814Z","shell.execute_reply":"2022-01-01T11:53:00.526779Z"},"trusted":true},"execution_count":53,"outputs":[]},{"cell_type":"code","source":"### PASS category COLUMNS IN PANDAS DF ###\n\nmodel = BoostRFE(clf_lgbm, param_grid=param_grid, min_features_to_select=1, step=1)\nmodel.fit(X_clf_train, y_clf_train, eval_set=[(X_clf_valid, y_clf_valid)], early_stopping_rounds=6, verbose=0)","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:53:00.529027Z","iopub.execute_input":"2022-01-01T11:53:00.529320Z","iopub.status.idle":"2022-01-01T11:53:12.422092Z","shell.execute_reply.started":"2022-01-01T11:53:00.529281Z","shell.execute_reply":"2022-01-01T11:53:12.421368Z"},"trusted":true},"execution_count":54,"outputs":[{"name":"stdout","text":"\n8 trials detected for ('learning_rate', 'num_leaves', 'max_depth')\n\ntrial: 0001 ### iterations: 00029 ### eval_score: 0.2036\ntrial: 0002 ### iterations: 00030 ### eval_score: 0.2034\ntrial: 0003 ### iterations: 00027 ### eval_score: 0.20617\ntrial: 0004 ### iterations: 00024 ### eval_score: 0.20003\ntrial: 0005 ### iterations: 00060 ### eval_score: 0.20332\ntrial: 0006 ### iterations: 00063 ### eval_score: 0.20329\ntrial: 0007 ### iterations: 00054 ### eval_score: 0.20136\ntrial: 0008 ### iterations: 00052 ### eval_score: 0.19959\n","output_type":"stream"},{"execution_count":54,"output_type":"execute_result","data":{"text/plain":"BoostRFE(estimator=LGBMClassifier(n_estimators=150, random_state=0),\n min_features_to_select=1,\n param_grid={'learning_rate': [0.2, 0.1], 'max_depth': [10, 12],\n 'num_leaves': [25, 35]})"},"metadata":{}}]}]} ================================================ FILE: notebooks/XGBoost_usage.ipynb ================================================ {"metadata":{"kernelspec":{"language":"python","display_name":"Python 3","name":"python3"},"language_info":{"name":"python","version":"3.7.12","mimetype":"text/x-python","codemirror_mode":{"name":"ipython","version":3},"pygments_lexer":"ipython3","nbconvert_exporter":"python","file_extension":".py"}},"nbformat_minor":4,"nbformat":4,"cells":[{"cell_type":"code","source":"import numpy as np\nimport pandas as pd\nfrom scipy import stats\n\nfrom sklearn.model_selection import train_test_split\nfrom sklearn.datasets import make_classification, make_regression\n\nfrom hyperopt import hp\nfrom hyperopt import Trials\n\nfrom xgboost import *\n\ntry:\n from shaphypetune import BoostSearch, BoostBoruta, BoostRFE, BoostRFA\nexcept:\n !pip install --upgrade shap-hypetune\n from shaphypetune import BoostSearch, BoostBoruta, BoostRFE, BoostRFA\n\nimport warnings\nwarnings.simplefilter('ignore')","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:49:44.031173Z","iopub.execute_input":"2022-01-01T11:49:44.031497Z","iopub.status.idle":"2022-01-01T11:49:45.071830Z","shell.execute_reply.started":"2022-01-01T11:49:44.031410Z","shell.execute_reply":"2022-01-01T11:49:45.070928Z"},"trusted":true},"execution_count":1,"outputs":[]},{"cell_type":"code","source":"X_clf, y_clf = make_classification(n_samples=6000, n_features=20, n_classes=2, \n n_informative=4, n_redundant=6, random_state=0)\n\nX_clf_train, X_clf_valid, y_clf_train, y_clf_valid = train_test_split(\n X_clf, y_clf, test_size=0.3, shuffle=False)\n\nX_regr, y_regr = make_classification(n_samples=6000, n_features=20,\n n_informative=7, random_state=0)\n\nX_regr_train, X_regr_valid, y_regr_train, y_regr_valid = train_test_split(\n X_regr, y_regr, test_size=0.3, shuffle=False)","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:49:45.073832Z","iopub.execute_input":"2022-01-01T11:49:45.074046Z","iopub.status.idle":"2022-01-01T11:49:45.098178Z","shell.execute_reply.started":"2022-01-01T11:49:45.074004Z","shell.execute_reply":"2022-01-01T11:49:45.097461Z"},"trusted":true},"execution_count":2,"outputs":[]},{"cell_type":"code","source":"param_grid = {\n 'learning_rate': [0.2, 0.1],\n 'num_leaves': [25, 35],\n 'max_depth': [10, 12]\n}\n\nparam_dist = {\n 'learning_rate': stats.uniform(0.09, 0.25),\n 'num_leaves': stats.randint(20,40),\n 'max_depth': [10, 12]\n}\n\nparam_dist_hyperopt = {\n 'max_depth': 15 + hp.randint('num_leaves', 5), \n 'learning_rate': hp.loguniform('learning_rate', np.log(0.01), np.log(0.2)),\n 'colsample_bytree': hp.uniform('colsample_by_tree', 0.6, 1.0)\n}\n\n\nregr_xgb = XGBRegressor(n_estimators=150, random_state=0, verbosity=0, n_jobs=-1)\nclf_xgb = XGBClassifier(n_estimators=150, random_state=0, verbosity=0, n_jobs=-1)","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:49:45.099715Z","iopub.execute_input":"2022-01-01T11:49:45.099916Z","iopub.status.idle":"2022-01-01T11:49:45.108765Z","shell.execute_reply.started":"2022-01-01T11:49:45.099890Z","shell.execute_reply":"2022-01-01T11:49:45.107996Z"},"trusted":true},"execution_count":3,"outputs":[]},{"cell_type":"markdown","source":"# Hyperparameters Tuning","metadata":{}},{"cell_type":"code","source":"### HYPERPARAM TUNING WITH GRID-SEARCH ###\n\nmodel = BoostSearch(clf_xgb, param_grid=param_grid)\nmodel.fit(X_clf_train, y_clf_train, eval_set=[(X_clf_valid, y_clf_valid)], early_stopping_rounds=6, verbose=0)","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:49:45.109686Z","iopub.execute_input":"2022-01-01T11:49:45.109871Z","iopub.status.idle":"2022-01-01T11:49:52.490942Z","shell.execute_reply.started":"2022-01-01T11:49:45.109848Z","shell.execute_reply":"2022-01-01T11:49:52.490078Z"},"trusted":true},"execution_count":4,"outputs":[{"name":"stdout","text":"\n8 trials detected for ('learning_rate', 'num_leaves', 'max_depth')\n\ntrial: 0001 ### iterations: 00021 ### eval_score: 0.2045\ntrial: 0002 ### iterations: 00026 ### eval_score: 0.19472\ntrial: 0003 ### iterations: 00021 ### eval_score: 0.2045\ntrial: 0004 ### iterations: 00026 ### eval_score: 0.19472\ntrial: 0005 ### iterations: 00045 ### eval_score: 0.19964\ntrial: 0006 ### iterations: 00050 ### eval_score: 0.20157\ntrial: 0007 ### iterations: 00045 ### eval_score: 0.19964\ntrial: 0008 ### iterations: 00050 ### eval_score: 0.20157\n","output_type":"stream"},{"execution_count":4,"output_type":"execute_result","data":{"text/plain":"BoostSearch(estimator=XGBClassifier(base_score=None, booster=None,\n colsample_bylevel=None,\n colsample_bynode=None,\n colsample_bytree=None,\n enable_categorical=False, gamma=None,\n gpu_id=None, importance_type=None,\n interaction_constraints=None,\n learning_rate=None, max_delta_step=None,\n max_depth=None, min_child_weight=None,\n missing=nan, monotone_constraints=None,\n n_estimators=150, n_jobs=-1,\n num_parallel_tree=None, predictor=None,\n random_state=0, reg_alpha=None,\n reg_lambda=None, scale_pos_weight=None,\n subsample=None, tree_method=None,\n validate_parameters=None, verbosity=0),\n param_grid={'learning_rate': [0.2, 0.1], 'max_depth': [10, 12],\n 'num_leaves': [25, 35]})"},"metadata":{}}]},{"cell_type":"code","source":"model.estimator_, model.best_params_, model.best_score_","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:49:52.493607Z","iopub.execute_input":"2022-01-01T11:49:52.494126Z","iopub.status.idle":"2022-01-01T11:49:52.504649Z","shell.execute_reply.started":"2022-01-01T11:49:52.494081Z","shell.execute_reply":"2022-01-01T11:49:52.503849Z"},"trusted":true},"execution_count":5,"outputs":[{"execution_count":5,"output_type":"execute_result","data":{"text/plain":"(XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,\n colsample_bynode=1, colsample_bytree=1, enable_categorical=False,\n gamma=0, gpu_id=-1, importance_type=None,\n interaction_constraints='', learning_rate=0.2, max_delta_step=0,\n max_depth=12, min_child_weight=1, missing=nan,\n monotone_constraints='()', n_estimators=150, n_jobs=-1,\n num_leaves=25, num_parallel_tree=1, predictor='auto',\n random_state=0, reg_alpha=0, reg_lambda=1, scale_pos_weight=1,\n subsample=1, tree_method='exact', validate_parameters=1,\n verbosity=0),\n {'learning_rate': 0.2, 'num_leaves': 25, 'max_depth': 12},\n 0.194719)"},"metadata":{}}]},{"cell_type":"code","source":"(model.score(X_clf_valid, y_clf_valid), \n model.predict(X_clf_valid).shape, \n model.predict_proba(X_clf_valid).shape)","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:49:52.506201Z","iopub.execute_input":"2022-01-01T11:49:52.506365Z","iopub.status.idle":"2022-01-01T11:49:52.528604Z","shell.execute_reply.started":"2022-01-01T11:49:52.506344Z","shell.execute_reply":"2022-01-01T11:49:52.528078Z"},"trusted":true},"execution_count":6,"outputs":[{"execution_count":6,"output_type":"execute_result","data":{"text/plain":"(0.9138888888888889, (1800,), (1800, 2))"},"metadata":{}}]},{"cell_type":"code","source":"### HYPERPARAM TUNING WITH RANDOM-SEARCH ###\n\nmodel = BoostSearch(\n regr_xgb, param_grid=param_dist,\n n_iter=8, sampling_seed=0\n)\nmodel.fit(X_regr_train, y_regr_train, eval_set=[(X_regr_valid, y_regr_valid)], early_stopping_rounds=6, verbose=0)","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:49:52.529476Z","iopub.execute_input":"2022-01-01T11:49:52.530097Z","iopub.status.idle":"2022-01-01T11:50:03.018637Z","shell.execute_reply.started":"2022-01-01T11:49:52.530066Z","shell.execute_reply":"2022-01-01T11:50:03.017927Z"},"trusted":true},"execution_count":7,"outputs":[{"name":"stdout","text":"\n8 trials detected for ('learning_rate', 'num_leaves', 'max_depth')\n\ntrial: 0001 ### iterations: 00012 ### eval_score: 0.27616\ntrial: 0002 ### iterations: 00056 ### eval_score: 0.26211\ntrial: 0003 ### iterations: 00078 ### eval_score: 0.27603\ntrial: 0004 ### iterations: 00045 ### eval_score: 0.26117\ntrial: 0005 ### iterations: 00046 ### eval_score: 0.27868\ntrial: 0006 ### iterations: 00035 ### eval_score: 0.27815\ntrial: 0007 ### iterations: 00039 ### eval_score: 0.2753\ntrial: 0008 ### iterations: 00016 ### eval_score: 0.28116\n","output_type":"stream"},{"execution_count":7,"output_type":"execute_result","data":{"text/plain":"BoostSearch(estimator=XGBRegressor(base_score=None, booster=None,\n colsample_bylevel=None,\n colsample_bynode=None, colsample_bytree=None,\n enable_categorical=False, gamma=None,\n gpu_id=None, importance_type=None,\n interaction_constraints=None,\n learning_rate=None, max_delta_step=None,\n max_depth=None, min_child_weight=None,\n missing=nan, monotone_constraints=None,\n n_estim...\n random_state=0, reg_alpha=None,\n reg_lambda=None, scale_pos_weight=None,\n subsample=None, tree_method=None,\n validate_parameters=None, verbosity=0),\n n_iter=8,\n param_grid={'learning_rate': ,\n 'max_depth': [10, 12],\n 'num_leaves': },\n sampling_seed=0)"},"metadata":{}}]},{"cell_type":"code","source":"model.estimator_, model.best_params_, model.best_score_","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:50:03.019747Z","iopub.execute_input":"2022-01-01T11:50:03.020416Z","iopub.status.idle":"2022-01-01T11:50:03.030730Z","shell.execute_reply.started":"2022-01-01T11:50:03.020379Z","shell.execute_reply":"2022-01-01T11:50:03.030065Z"},"trusted":true},"execution_count":8,"outputs":[{"execution_count":8,"output_type":"execute_result","data":{"text/plain":"(XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,\n colsample_bynode=1, colsample_bytree=1, enable_categorical=False,\n gamma=0, gpu_id=-1, importance_type=None,\n interaction_constraints='', learning_rate=0.1669837381562427,\n max_delta_step=0, max_depth=10, min_child_weight=1, missing=nan,\n monotone_constraints='()', n_estimators=150, n_jobs=-1,\n num_leaves=25, num_parallel_tree=1, predictor='auto',\n random_state=0, reg_alpha=0, reg_lambda=1, scale_pos_weight=1,\n subsample=1, tree_method='exact', validate_parameters=1,\n verbosity=0),\n {'learning_rate': 0.1669837381562427, 'num_leaves': 25, 'max_depth': 10},\n 0.26117)"},"metadata":{}}]},{"cell_type":"code","source":"(model.score(X_regr_valid, y_regr_valid), \n model.predict(X_regr_valid).shape)","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:50:03.032146Z","iopub.execute_input":"2022-01-01T11:50:03.032612Z","iopub.status.idle":"2022-01-01T11:50:03.058721Z","shell.execute_reply.started":"2022-01-01T11:50:03.032572Z","shell.execute_reply":"2022-01-01T11:50:03.058084Z"},"trusted":true},"execution_count":9,"outputs":[{"execution_count":9,"output_type":"execute_result","data":{"text/plain":"(0.7271524639165458, (1800,))"},"metadata":{}}]},{"cell_type":"code","source":"### HYPERPARAM TUNING WITH HYPEROPT ###\n\nmodel = BoostSearch(\n regr_xgb, param_grid=param_dist_hyperopt,\n n_iter=8, sampling_seed=0\n)\nmodel.fit(\n X_regr_train, y_regr_train, trials=Trials(), \n eval_set=[(X_regr_valid, y_regr_valid)], early_stopping_rounds=6, verbose=0\n)","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:50:03.059800Z","iopub.execute_input":"2022-01-01T11:50:03.062204Z","iopub.status.idle":"2022-01-01T11:50:32.323625Z","shell.execute_reply.started":"2022-01-01T11:50:03.062158Z","shell.execute_reply":"2022-01-01T11:50:32.322789Z"},"trusted":true},"execution_count":10,"outputs":[{"name":"stdout","text":"\n8 trials detected for ('max_depth', 'learning_rate', 'colsample_bytree')\n\ntrial: 0001 ### iterations: 00149 ### eval_score: 0.27498\ntrial: 0002 ### iterations: 00074 ### eval_score: 0.27186\ntrial: 0003 ### iterations: 00038 ### eval_score: 0.28326\ntrial: 0004 ### iterations: 00149 ### eval_score: 0.29455\ntrial: 0005 ### iterations: 00149 ### eval_score: 0.28037\ntrial: 0006 ### iterations: 00149 ### eval_score: 0.26421\ntrial: 0007 ### iterations: 00052 ### eval_score: 0.27191\ntrial: 0008 ### iterations: 00133 ### eval_score: 0.29251\n","output_type":"stream"},{"execution_count":10,"output_type":"execute_result","data":{"text/plain":"BoostSearch(estimator=XGBRegressor(base_score=None, booster=None,\n colsample_bylevel=None,\n colsample_bynode=None, colsample_bytree=None,\n enable_categorical=False, gamma=None,\n gpu_id=None, importance_type=None,\n interaction_constraints=None,\n learning_rate=None, max_delta_step=None,\n max_depth=None, min_child_weight=None,\n missing=nan, monotone_constraints=None,\n n_estim...\n random_state=0, reg_alpha=None,\n reg_lambda=None, scale_pos_weight=None,\n subsample=None, tree_method=None,\n validate_parameters=None, verbosity=0),\n n_iter=8,\n param_grid={'colsample_bytree': ,\n 'learning_rate': ,\n 'max_depth': },\n sampling_seed=0)"},"metadata":{}}]},{"cell_type":"code","source":"model.estimator_, model.best_params_, model.best_score_","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:50:32.324994Z","iopub.execute_input":"2022-01-01T11:50:32.325480Z","iopub.status.idle":"2022-01-01T11:50:32.335828Z","shell.execute_reply.started":"2022-01-01T11:50:32.325441Z","shell.execute_reply":"2022-01-01T11:50:32.334970Z"},"trusted":true},"execution_count":11,"outputs":[{"execution_count":11,"output_type":"execute_result","data":{"text/plain":"(XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,\n colsample_bynode=1, colsample_bytree=0.7597292534356749,\n enable_categorical=False, gamma=0, gpu_id=-1, importance_type=None,\n interaction_constraints='', learning_rate=0.059836658149176665,\n max_delta_step=0, max_depth=16, min_child_weight=1, missing=nan,\n monotone_constraints='()', n_estimators=150, n_jobs=-1,\n num_parallel_tree=1, predictor='auto', random_state=0, reg_alpha=0,\n reg_lambda=1, scale_pos_weight=1, subsample=1, tree_method='exact',\n validate_parameters=1, verbosity=0),\n {'colsample_bytree': 0.7597292534356749,\n 'learning_rate': 0.059836658149176665,\n 'max_depth': 16},\n 0.264211)"},"metadata":{}}]},{"cell_type":"code","source":"(model.score(X_regr_valid, y_regr_valid), \n model.predict(X_regr_valid).shape)","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:50:32.337011Z","iopub.execute_input":"2022-01-01T11:50:32.337395Z","iopub.status.idle":"2022-01-01T11:50:32.370381Z","shell.execute_reply.started":"2022-01-01T11:50:32.337369Z","shell.execute_reply":"2022-01-01T11:50:32.369816Z"},"trusted":true},"execution_count":12,"outputs":[{"execution_count":12,"output_type":"execute_result","data":{"text/plain":"(0.7207605727361562, (1800,))"},"metadata":{}}]},{"cell_type":"markdown","source":"# Features Selection","metadata":{}},{"cell_type":"code","source":"### BORUTA ###\n\nmodel = BoostBoruta(clf_xgb, max_iter=200, perc=100)\nmodel.fit(X_clf_train, y_clf_train, eval_set=[(X_clf_valid, y_clf_valid)], early_stopping_rounds=6, verbose=0)","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:50:32.371634Z","iopub.execute_input":"2022-01-01T11:50:32.372109Z","iopub.status.idle":"2022-01-01T11:50:50.797541Z","shell.execute_reply.started":"2022-01-01T11:50:32.372066Z","shell.execute_reply":"2022-01-01T11:50:50.797059Z"},"trusted":true},"execution_count":13,"outputs":[{"execution_count":13,"output_type":"execute_result","data":{"text/plain":"BoostBoruta(estimator=XGBClassifier(base_score=None, booster=None,\n colsample_bylevel=None,\n colsample_bynode=None,\n colsample_bytree=None,\n enable_categorical=False, gamma=None,\n gpu_id=None, importance_type=None,\n interaction_constraints=None,\n learning_rate=None, max_delta_step=None,\n max_depth=None, min_child_weight=None,\n missing=nan, monotone_constraints=None,\n n_estimators=150, n_jobs=-1,\n num_parallel_tree=None, predictor=None,\n random_state=0, reg_alpha=None,\n reg_lambda=None, scale_pos_weight=None,\n subsample=None, tree_method=None,\n validate_parameters=None, verbosity=0),\n max_iter=200)"},"metadata":{}}]},{"cell_type":"code","source":"model.estimator_, model.n_features_","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:50:50.800394Z","iopub.execute_input":"2022-01-01T11:50:50.800795Z","iopub.status.idle":"2022-01-01T11:50:50.809566Z","shell.execute_reply.started":"2022-01-01T11:50:50.800767Z","shell.execute_reply":"2022-01-01T11:50:50.808911Z"},"trusted":true},"execution_count":14,"outputs":[{"execution_count":14,"output_type":"execute_result","data":{"text/plain":"(XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,\n colsample_bynode=1, colsample_bytree=1, enable_categorical=False,\n gamma=0, gpu_id=-1, importance_type=None,\n interaction_constraints='', learning_rate=0.300000012,\n max_delta_step=0, max_depth=6, min_child_weight=1, missing=nan,\n monotone_constraints='()', n_estimators=150, n_jobs=-1,\n num_parallel_tree=1, predictor='auto', random_state=0,\n reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1,\n tree_method='exact', validate_parameters=1, verbosity=0),\n 11)"},"metadata":{}}]},{"cell_type":"code","source":"(model.score(X_clf_valid, y_clf_valid), \n model.predict(X_clf_valid).shape, \n model.transform(X_clf_valid).shape,\n model.predict_proba(X_clf_valid).shape)","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:50:50.810633Z","iopub.execute_input":"2022-01-01T11:50:50.811078Z","iopub.status.idle":"2022-01-01T11:50:50.834426Z","shell.execute_reply.started":"2022-01-01T11:50:50.811040Z","shell.execute_reply":"2022-01-01T11:50:50.833776Z"},"trusted":true},"execution_count":15,"outputs":[{"execution_count":15,"output_type":"execute_result","data":{"text/plain":"(0.9161111111111111, (1800,), (1800, 11), (1800, 2))"},"metadata":{}}]},{"cell_type":"code","source":"### RECURSIVE FEATURE ELIMINATION (RFE) ###\n\nmodel = BoostRFE(regr_xgb, min_features_to_select=1, step=1)\nmodel.fit(X_regr_train, y_regr_train, eval_set=[(X_regr_valid, y_regr_valid)], early_stopping_rounds=6, verbose=0)","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:50:50.835608Z","iopub.execute_input":"2022-01-01T11:50:50.836142Z","iopub.status.idle":"2022-01-01T11:50:58.558180Z","shell.execute_reply.started":"2022-01-01T11:50:50.836100Z","shell.execute_reply":"2022-01-01T11:50:58.557365Z"},"trusted":true},"execution_count":16,"outputs":[{"execution_count":16,"output_type":"execute_result","data":{"text/plain":"BoostRFE(estimator=XGBRegressor(base_score=None, booster=None,\n colsample_bylevel=None, colsample_bynode=None,\n colsample_bytree=None, enable_categorical=False,\n gamma=None, gpu_id=None, importance_type=None,\n interaction_constraints=None,\n learning_rate=None, max_delta_step=None,\n max_depth=None, min_child_weight=None,\n missing=nan, monotone_constraints=None,\n n_estimators=150, n_jobs=-1,\n num_parallel_tree=None, predictor=None,\n random_state=0, reg_alpha=None, reg_lambda=None,\n scale_pos_weight=None, subsample=None,\n tree_method=None, validate_parameters=None,\n verbosity=0),\n min_features_to_select=1)"},"metadata":{}}]},{"cell_type":"code","source":"model.estimator_, model.n_features_","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:50:58.559585Z","iopub.execute_input":"2022-01-01T11:50:58.560110Z","iopub.status.idle":"2022-01-01T11:50:58.569301Z","shell.execute_reply.started":"2022-01-01T11:50:58.560048Z","shell.execute_reply":"2022-01-01T11:50:58.568542Z"},"trusted":true},"execution_count":17,"outputs":[{"execution_count":17,"output_type":"execute_result","data":{"text/plain":"(XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,\n colsample_bynode=1, colsample_bytree=1, enable_categorical=False,\n gamma=0, gpu_id=-1, importance_type=None,\n interaction_constraints='', learning_rate=0.300000012,\n max_delta_step=0, max_depth=6, min_child_weight=1, missing=nan,\n monotone_constraints='()', n_estimators=150, n_jobs=-1,\n num_parallel_tree=1, predictor='auto', random_state=0, reg_alpha=0,\n reg_lambda=1, scale_pos_weight=1, subsample=1, tree_method='exact',\n validate_parameters=1, verbosity=0),\n 7)"},"metadata":{}}]},{"cell_type":"code","source":"(model.score(X_regr_valid, y_regr_valid), \n model.predict(X_regr_valid).shape, \n model.transform(X_regr_valid).shape)","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:50:58.570558Z","iopub.execute_input":"2022-01-01T11:50:58.570828Z","iopub.status.idle":"2022-01-01T11:50:58.584624Z","shell.execute_reply.started":"2022-01-01T11:50:58.570792Z","shell.execute_reply":"2022-01-01T11:50:58.584081Z"},"trusted":true},"execution_count":18,"outputs":[{"execution_count":18,"output_type":"execute_result","data":{"text/plain":"(0.7317444492376407, (1800,), (1800, 7))"},"metadata":{}}]},{"cell_type":"code","source":"### RECURSIVE FEATURE ADDITION (RFA) ###\n\nmodel = BoostRFA(regr_xgb, min_features_to_select=1, step=1)\nmodel.fit(X_regr_train, y_regr_train, eval_set=[(X_regr_valid, y_regr_valid)], early_stopping_rounds=6, verbose=0)","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:50:58.585749Z","iopub.execute_input":"2022-01-01T11:50:58.586163Z","iopub.status.idle":"2022-01-01T11:51:09.404587Z","shell.execute_reply.started":"2022-01-01T11:50:58.586126Z","shell.execute_reply":"2022-01-01T11:51:09.403781Z"},"trusted":true},"execution_count":19,"outputs":[{"execution_count":19,"output_type":"execute_result","data":{"text/plain":"BoostRFA(estimator=XGBRegressor(base_score=None, booster=None,\n colsample_bylevel=None, colsample_bynode=None,\n colsample_bytree=None, enable_categorical=False,\n gamma=None, gpu_id=None, importance_type=None,\n interaction_constraints=None,\n learning_rate=None, max_delta_step=None,\n max_depth=None, min_child_weight=None,\n missing=nan, monotone_constraints=None,\n n_estimators=150, n_jobs=-1,\n num_parallel_tree=None, predictor=None,\n random_state=0, reg_alpha=None, reg_lambda=None,\n scale_pos_weight=None, subsample=None,\n tree_method=None, validate_parameters=None,\n verbosity=0),\n min_features_to_select=1)"},"metadata":{}}]},{"cell_type":"code","source":"model.estimator_, model.n_features_","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:51:09.406057Z","iopub.execute_input":"2022-01-01T11:51:09.406434Z","iopub.status.idle":"2022-01-01T11:51:09.416068Z","shell.execute_reply.started":"2022-01-01T11:51:09.406399Z","shell.execute_reply":"2022-01-01T11:51:09.415411Z"},"trusted":true},"execution_count":20,"outputs":[{"execution_count":20,"output_type":"execute_result","data":{"text/plain":"(XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,\n colsample_bynode=1, colsample_bytree=1, enable_categorical=False,\n gamma=0, gpu_id=-1, importance_type=None,\n interaction_constraints='', learning_rate=0.300000012,\n max_delta_step=0, max_depth=6, min_child_weight=1, missing=nan,\n monotone_constraints='()', n_estimators=150, n_jobs=-1,\n num_parallel_tree=1, predictor='auto', random_state=0, reg_alpha=0,\n reg_lambda=1, scale_pos_weight=1, subsample=1, tree_method='exact',\n validate_parameters=1, verbosity=0),\n 8)"},"metadata":{}}]},{"cell_type":"code","source":"(model.score(X_regr_valid, y_regr_valid), \n model.predict(X_regr_valid).shape, \n model.transform(X_regr_valid).shape)","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:51:09.417248Z","iopub.execute_input":"2022-01-01T11:51:09.417698Z","iopub.status.idle":"2022-01-01T11:51:09.450280Z","shell.execute_reply.started":"2022-01-01T11:51:09.417657Z","shell.execute_reply":"2022-01-01T11:51:09.449664Z"},"trusted":true},"execution_count":21,"outputs":[{"execution_count":21,"output_type":"execute_result","data":{"text/plain":"(0.7274037362877257, (1800,), (1800, 8))"},"metadata":{}}]},{"cell_type":"markdown","source":"# Features Selection with SHAP","metadata":{}},{"cell_type":"code","source":"### BORUTA SHAP ###\n\nmodel = BoostBoruta(\n clf_xgb, max_iter=200, perc=100,\n importance_type='shap_importances', train_importance=False\n)\nmodel.fit(X_clf_train, y_clf_train, eval_set=[(X_clf_valid, y_clf_valid)], early_stopping_rounds=6, verbose=0)","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:51:09.451169Z","iopub.execute_input":"2022-01-01T11:51:09.451507Z","iopub.status.idle":"2022-01-01T11:51:33.925757Z","shell.execute_reply.started":"2022-01-01T11:51:09.451482Z","shell.execute_reply":"2022-01-01T11:51:33.925076Z"},"trusted":true},"execution_count":22,"outputs":[{"execution_count":22,"output_type":"execute_result","data":{"text/plain":"BoostBoruta(estimator=XGBClassifier(base_score=None, booster=None,\n colsample_bylevel=None,\n colsample_bynode=None,\n colsample_bytree=None,\n enable_categorical=False, gamma=None,\n gpu_id=None, importance_type=None,\n interaction_constraints=None,\n learning_rate=None, max_delta_step=None,\n max_depth=None, min_child_weight=None,\n missing=nan, monotone_constraints=None,\n n_estimators=150, n_jobs=-1,\n num_parallel_tree=None, predictor=None,\n random_state=0, reg_alpha=None,\n reg_lambda=None, scale_pos_weight=None,\n subsample=None, tree_method=None,\n validate_parameters=None, verbosity=0),\n importance_type='shap_importances', max_iter=200,\n train_importance=False)"},"metadata":{}}]},{"cell_type":"code","source":"model.estimator_, model.n_features_","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:51:33.926762Z","iopub.execute_input":"2022-01-01T11:51:33.926940Z","iopub.status.idle":"2022-01-01T11:51:33.934907Z","shell.execute_reply.started":"2022-01-01T11:51:33.926918Z","shell.execute_reply":"2022-01-01T11:51:33.934315Z"},"trusted":true},"execution_count":23,"outputs":[{"execution_count":23,"output_type":"execute_result","data":{"text/plain":"(XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,\n colsample_bynode=1, colsample_bytree=1, enable_categorical=False,\n gamma=0, gpu_id=-1, importance_type=None,\n interaction_constraints='', learning_rate=0.300000012,\n max_delta_step=0, max_depth=6, min_child_weight=1, missing=nan,\n monotone_constraints='()', n_estimators=150, n_jobs=-1,\n num_parallel_tree=1, predictor='auto', random_state=0,\n reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1,\n tree_method='exact', validate_parameters=1, verbosity=0),\n 10)"},"metadata":{}}]},{"cell_type":"code","source":"(model.score(X_clf_valid, y_clf_valid), \n model.predict(X_clf_valid).shape, \n model.transform(X_clf_valid).shape,\n model.predict_proba(X_clf_valid).shape)","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:51:33.935950Z","iopub.execute_input":"2022-01-01T11:51:33.936419Z","iopub.status.idle":"2022-01-01T11:51:33.961319Z","shell.execute_reply.started":"2022-01-01T11:51:33.936381Z","shell.execute_reply":"2022-01-01T11:51:33.960533Z"},"trusted":true},"execution_count":24,"outputs":[{"execution_count":24,"output_type":"execute_result","data":{"text/plain":"(0.91, (1800,), (1800, 10), (1800, 2))"},"metadata":{}}]},{"cell_type":"code","source":"### RECURSIVE FEATURE ELIMINATION (RFE) SHAP ###\n\nmodel = BoostRFE(\n regr_xgb, min_features_to_select=1, step=1,\n importance_type='shap_importances', train_importance=False\n)\nmodel.fit(X_regr_train, y_regr_train, eval_set=[(X_regr_valid, y_regr_valid)], early_stopping_rounds=6, verbose=0)","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:51:33.962369Z","iopub.execute_input":"2022-01-01T11:51:33.962555Z","iopub.status.idle":"2022-01-01T11:51:47.059712Z","shell.execute_reply.started":"2022-01-01T11:51:33.962532Z","shell.execute_reply":"2022-01-01T11:51:47.058892Z"},"trusted":true},"execution_count":25,"outputs":[{"execution_count":25,"output_type":"execute_result","data":{"text/plain":"BoostRFE(estimator=XGBRegressor(base_score=None, booster=None,\n colsample_bylevel=None, colsample_bynode=None,\n colsample_bytree=None, enable_categorical=False,\n gamma=None, gpu_id=None, importance_type=None,\n interaction_constraints=None,\n learning_rate=None, max_delta_step=None,\n max_depth=None, min_child_weight=None,\n missing=nan, monotone_constraints=None,\n n_estimators=150, n_jobs=-1,\n num_parallel_tree=None, predictor=None,\n random_state=0, reg_alpha=None, reg_lambda=None,\n scale_pos_weight=None, subsample=None,\n tree_method=None, validate_parameters=None,\n verbosity=0),\n importance_type='shap_importances', min_features_to_select=1,\n train_importance=False)"},"metadata":{}}]},{"cell_type":"code","source":"model.estimator_, model.n_features_","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:51:47.060847Z","iopub.execute_input":"2022-01-01T11:51:47.061090Z","iopub.status.idle":"2022-01-01T11:51:47.069229Z","shell.execute_reply.started":"2022-01-01T11:51:47.061061Z","shell.execute_reply":"2022-01-01T11:51:47.068462Z"},"trusted":true},"execution_count":26,"outputs":[{"execution_count":26,"output_type":"execute_result","data":{"text/plain":"(XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,\n colsample_bynode=1, colsample_bytree=1, enable_categorical=False,\n gamma=0, gpu_id=-1, importance_type=None,\n interaction_constraints='', learning_rate=0.300000012,\n max_delta_step=0, max_depth=6, min_child_weight=1, missing=nan,\n monotone_constraints='()', n_estimators=150, n_jobs=-1,\n num_parallel_tree=1, predictor='auto', random_state=0, reg_alpha=0,\n reg_lambda=1, scale_pos_weight=1, subsample=1, tree_method='exact',\n validate_parameters=1, verbosity=0),\n 7)"},"metadata":{}}]},{"cell_type":"code","source":"(model.score(X_regr_valid, y_regr_valid), \n model.predict(X_regr_valid).shape, \n model.transform(X_regr_valid).shape)","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:51:47.070353Z","iopub.execute_input":"2022-01-01T11:51:47.071217Z","iopub.status.idle":"2022-01-01T11:51:47.087333Z","shell.execute_reply.started":"2022-01-01T11:51:47.071168Z","shell.execute_reply":"2022-01-01T11:51:47.086754Z"},"trusted":true},"execution_count":27,"outputs":[{"execution_count":27,"output_type":"execute_result","data":{"text/plain":"(0.7317444492376407, (1800,), (1800, 7))"},"metadata":{}}]},{"cell_type":"code","source":"### RECURSIVE FEATURE ADDITION (RFA) SHAP ###\n\nmodel = BoostRFA(\n regr_xgb, min_features_to_select=1, step=1,\n importance_type='shap_importances', train_importance=False\n)\nmodel.fit(\n X_regr_train, y_regr_train, trials=Trials(), \n eval_set=[(X_regr_valid, y_regr_valid)], early_stopping_rounds=6, verbose=0\n)","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:51:47.088455Z","iopub.execute_input":"2022-01-01T11:51:47.088921Z","iopub.status.idle":"2022-01-01T11:51:59.186202Z","shell.execute_reply.started":"2022-01-01T11:51:47.088885Z","shell.execute_reply":"2022-01-01T11:51:59.185431Z"},"trusted":true},"execution_count":28,"outputs":[{"execution_count":28,"output_type":"execute_result","data":{"text/plain":"BoostRFA(estimator=XGBRegressor(base_score=None, booster=None,\n colsample_bylevel=None, colsample_bynode=None,\n colsample_bytree=None, enable_categorical=False,\n gamma=None, gpu_id=None, importance_type=None,\n interaction_constraints=None,\n learning_rate=None, max_delta_step=None,\n max_depth=None, min_child_weight=None,\n missing=nan, monotone_constraints=None,\n n_estimators=150, n_jobs=-1,\n num_parallel_tree=None, predictor=None,\n random_state=0, reg_alpha=None, reg_lambda=None,\n scale_pos_weight=None, subsample=None,\n tree_method=None, validate_parameters=None,\n verbosity=0),\n importance_type='shap_importances', min_features_to_select=1,\n train_importance=False)"},"metadata":{}}]},{"cell_type":"code","source":"model.estimator_, model.n_features_","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:51:59.187276Z","iopub.execute_input":"2022-01-01T11:51:59.188081Z","iopub.status.idle":"2022-01-01T11:51:59.199276Z","shell.execute_reply.started":"2022-01-01T11:51:59.188004Z","shell.execute_reply":"2022-01-01T11:51:59.198325Z"},"trusted":true},"execution_count":29,"outputs":[{"execution_count":29,"output_type":"execute_result","data":{"text/plain":"(XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,\n colsample_bynode=1, colsample_bytree=1, enable_categorical=False,\n gamma=0, gpu_id=-1, importance_type=None,\n interaction_constraints='', learning_rate=0.300000012,\n max_delta_step=0, max_depth=6, min_child_weight=1, missing=nan,\n monotone_constraints='()', n_estimators=150, n_jobs=-1,\n num_parallel_tree=1, predictor='auto', random_state=0, reg_alpha=0,\n reg_lambda=1, scale_pos_weight=1, subsample=1, tree_method='exact',\n validate_parameters=1, verbosity=0),\n 9)"},"metadata":{}}]},{"cell_type":"code","source":"(model.score(X_regr_valid, y_regr_valid), \n model.predict(X_regr_valid).shape, \n model.transform(X_regr_valid).shape)","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:51:59.200366Z","iopub.execute_input":"2022-01-01T11:51:59.200640Z","iopub.status.idle":"2022-01-01T11:51:59.222774Z","shell.execute_reply.started":"2022-01-01T11:51:59.200592Z","shell.execute_reply":"2022-01-01T11:51:59.222078Z"},"trusted":true},"execution_count":30,"outputs":[{"execution_count":30,"output_type":"execute_result","data":{"text/plain":"(0.7249664284333042, (1800,), (1800, 9))"},"metadata":{}}]},{"cell_type":"markdown","source":"# Hyperparameters Tuning + Features Selection","metadata":{}},{"cell_type":"code","source":"### HYPERPARAM TUNING WITH GRID-SEARCH + BORUTA ###\n\nmodel = BoostBoruta(clf_xgb, param_grid=param_grid, max_iter=200, perc=100)\nmodel.fit(X_clf_train, y_clf_train, eval_set=[(X_clf_valid, y_clf_valid)], early_stopping_rounds=6, verbose=0)","metadata":{"execution":{"iopub.status.busy":"2022-01-01T11:51:59.224176Z","iopub.execute_input":"2022-01-01T11:51:59.224707Z","iopub.status.idle":"2022-01-01T12:14:09.045290Z","shell.execute_reply.started":"2022-01-01T11:51:59.224667Z","shell.execute_reply":"2022-01-01T12:14:09.044649Z"},"trusted":true},"execution_count":31,"outputs":[{"name":"stdout","text":"\n8 trials detected for ('learning_rate', 'num_leaves', 'max_depth')\n\ntrial: 0001 ### iterations: 00026 ### eval_score: 0.20001\ntrial: 0002 ### iterations: 00022 ### eval_score: 0.20348\ntrial: 0003 ### iterations: 00026 ### eval_score: 0.20001\ntrial: 0004 ### iterations: 00022 ### eval_score: 0.20348\ntrial: 0005 ### iterations: 00048 ### eval_score: 0.19925\ntrial: 0006 ### iterations: 00052 ### eval_score: 0.20307\ntrial: 0007 ### iterations: 00048 ### eval_score: 0.19925\ntrial: 0008 ### iterations: 00052 ### eval_score: 0.20307\n","output_type":"stream"},{"execution_count":31,"output_type":"execute_result","data":{"text/plain":"BoostBoruta(estimator=XGBClassifier(base_score=None, booster=None,\n colsample_bylevel=None,\n colsample_bynode=None,\n colsample_bytree=None,\n enable_categorical=False, gamma=None,\n gpu_id=None, importance_type=None,\n interaction_constraints=None,\n learning_rate=None, max_delta_step=None,\n max_depth=None, min_child_weight=None,\n missing=nan, monotone_constraints=None,\n n_estimators=150, n_jobs=-1,\n num_parallel_tree=None, predictor=None,\n random_state=0, reg_alpha=None,\n reg_lambda=None, scale_pos_weight=None,\n subsample=None, tree_method=None,\n validate_parameters=None, verbosity=0),\n max_iter=200,\n param_grid={'learning_rate': [0.2, 0.1], 'max_depth': [10, 12],\n 'num_leaves': [25, 35]})"},"metadata":{}}]},{"cell_type":"code","source":"model.estimator_, model.best_params_, model.best_score_, model.n_features_","metadata":{"execution":{"iopub.status.busy":"2022-01-01T12:14:09.046490Z","iopub.execute_input":"2022-01-01T12:14:09.047104Z","iopub.status.idle":"2022-01-01T12:14:09.056559Z","shell.execute_reply.started":"2022-01-01T12:14:09.047070Z","shell.execute_reply":"2022-01-01T12:14:09.056076Z"},"trusted":true},"execution_count":32,"outputs":[{"execution_count":32,"output_type":"execute_result","data":{"text/plain":"(XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,\n colsample_bynode=1, colsample_bytree=1, enable_categorical=False,\n gamma=0, gpu_id=-1, importance_type=None,\n interaction_constraints='', learning_rate=0.1, max_delta_step=0,\n max_depth=10, min_child_weight=1, missing=nan,\n monotone_constraints='()', n_estimators=150, n_jobs=-1,\n num_leaves=25, num_parallel_tree=1, predictor='auto',\n random_state=0, reg_alpha=0, reg_lambda=1, scale_pos_weight=1,\n subsample=1, tree_method='exact', validate_parameters=1,\n verbosity=0),\n {'learning_rate': 0.1, 'num_leaves': 25, 'max_depth': 10},\n 0.199248,\n 11)"},"metadata":{}}]},{"cell_type":"code","source":"(model.score(X_clf_valid, y_clf_valid), \n model.predict(X_clf_valid).shape, \n model.transform(X_clf_valid).shape,\n model.predict_proba(X_clf_valid).shape)","metadata":{"execution":{"iopub.status.busy":"2022-01-01T12:14:09.057462Z","iopub.execute_input":"2022-01-01T12:14:09.057740Z","iopub.status.idle":"2022-01-01T12:14:09.086612Z","shell.execute_reply.started":"2022-01-01T12:14:09.057716Z","shell.execute_reply":"2022-01-01T12:14:09.085920Z"},"trusted":true},"execution_count":33,"outputs":[{"execution_count":33,"output_type":"execute_result","data":{"text/plain":"(0.9144444444444444, (1800,), (1800, 11), (1800, 2))"},"metadata":{}}]},{"cell_type":"code","source":"### HYPERPARAM TUNING WITH RANDOM-SEARCH + RECURSIVE FEATURE ELIMINATION (RFE) ###\n\nmodel = BoostRFE(\n regr_xgb, param_grid=param_dist, min_features_to_select=1, step=1,\n n_iter=8, sampling_seed=0\n)\nmodel.fit(X_regr_train, y_regr_train, eval_set=[(X_regr_valid, y_regr_valid)], early_stopping_rounds=6, verbose=0)","metadata":{"execution":{"iopub.status.busy":"2022-01-01T12:14:09.087595Z","iopub.execute_input":"2022-01-01T12:14:09.087798Z","iopub.status.idle":"2022-01-01T12:16:42.203604Z","shell.execute_reply.started":"2022-01-01T12:14:09.087772Z","shell.execute_reply":"2022-01-01T12:16:42.202743Z"},"trusted":true},"execution_count":34,"outputs":[{"name":"stdout","text":"\n8 trials detected for ('learning_rate', 'num_leaves', 'max_depth')\n\ntrial: 0001 ### iterations: 00021 ### eval_score: 0.25941\ntrial: 0002 ### iterations: 00077 ### eval_score: 0.25055\ntrial: 0003 ### iterations: 00086 ### eval_score: 0.25676\ntrial: 0004 ### iterations: 00098 ### eval_score: 0.25383\ntrial: 0005 ### iterations: 00050 ### eval_score: 0.25751\ntrial: 0006 ### iterations: 00028 ### eval_score: 0.26007\ntrial: 0007 ### iterations: 00084 ### eval_score: 0.2603\ntrial: 0008 ### iterations: 00024 ### eval_score: 0.26278\n","output_type":"stream"},{"execution_count":34,"output_type":"execute_result","data":{"text/plain":"BoostRFE(estimator=XGBRegressor(base_score=None, booster=None,\n colsample_bylevel=None, colsample_bynode=None,\n colsample_bytree=None, enable_categorical=False,\n gamma=None, gpu_id=None, importance_type=None,\n interaction_constraints=None,\n learning_rate=None, max_delta_step=None,\n max_depth=None, min_child_weight=None,\n missing=nan, monotone_constraints=None,\n n_estimato...\n random_state=0, reg_alpha=None, reg_lambda=None,\n scale_pos_weight=None, subsample=None,\n tree_method=None, validate_parameters=None,\n verbosity=0),\n min_features_to_select=1, n_iter=8,\n param_grid={'learning_rate': ,\n 'max_depth': [10, 12],\n 'num_leaves': },\n sampling_seed=0)"},"metadata":{}}]},{"cell_type":"code","source":"model.estimator_, model.best_params_, model.best_score_, model.n_features_","metadata":{"execution":{"iopub.status.busy":"2022-01-01T12:16:42.205176Z","iopub.execute_input":"2022-01-01T12:16:42.205439Z","iopub.status.idle":"2022-01-01T12:16:42.215355Z","shell.execute_reply.started":"2022-01-01T12:16:42.205404Z","shell.execute_reply":"2022-01-01T12:16:42.214732Z"},"trusted":true},"execution_count":35,"outputs":[{"execution_count":35,"output_type":"execute_result","data":{"text/plain":"(XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,\n colsample_bynode=1, colsample_bytree=1, enable_categorical=False,\n gamma=0, gpu_id=-1, importance_type=None,\n interaction_constraints='', learning_rate=0.1350674222191923,\n max_delta_step=0, max_depth=10, min_child_weight=1, missing=nan,\n monotone_constraints='()', n_estimators=150, n_jobs=-1,\n num_leaves=38, num_parallel_tree=1, predictor='auto',\n random_state=0, reg_alpha=0, reg_lambda=1, scale_pos_weight=1,\n subsample=1, tree_method='exact', validate_parameters=1,\n verbosity=0),\n {'learning_rate': 0.1350674222191923, 'num_leaves': 38, 'max_depth': 10},\n 0.250552,\n 10)"},"metadata":{}}]},{"cell_type":"code","source":"(model.score(X_regr_valid, y_regr_valid), \n model.predict(X_regr_valid).shape, \n model.transform(X_regr_valid).shape)","metadata":{"execution":{"iopub.status.busy":"2022-01-01T12:16:42.216398Z","iopub.execute_input":"2022-01-01T12:16:42.216642Z","iopub.status.idle":"2022-01-01T12:16:42.242381Z","shell.execute_reply.started":"2022-01-01T12:16:42.216606Z","shell.execute_reply":"2022-01-01T12:16:42.241879Z"},"trusted":true},"execution_count":36,"outputs":[{"execution_count":36,"output_type":"execute_result","data":{"text/plain":"(0.7488873349293266, (1800,), (1800, 10))"},"metadata":{}}]},{"cell_type":"code","source":"### HYPERPARAM TUNING WITH HYPEROPT + RECURSIVE FEATURE ADDITION (RFA) ###\n\nmodel = BoostRFA(\n regr_xgb, param_grid=param_dist_hyperopt, min_features_to_select=1, step=1,\n n_iter=8, sampling_seed=0\n)\nmodel.fit(\n X_regr_train, y_regr_train, trials=Trials(), \n eval_set=[(X_regr_valid, y_regr_valid)], early_stopping_rounds=6, verbose=0\n)","metadata":{"execution":{"iopub.status.busy":"2022-01-01T12:16:42.245655Z","iopub.execute_input":"2022-01-01T12:16:42.247219Z","iopub.status.idle":"2022-01-01T12:26:08.685124Z","shell.execute_reply.started":"2022-01-01T12:16:42.247188Z","shell.execute_reply":"2022-01-01T12:26:08.684364Z"},"trusted":true},"execution_count":37,"outputs":[{"name":"stdout","text":"\n8 trials detected for ('max_depth', 'learning_rate', 'colsample_bytree')\n\ntrial: 0001 ### iterations: 00149 ### eval_score: 0.26412\ntrial: 0002 ### iterations: 00080 ### eval_score: 0.25357\ntrial: 0003 ### iterations: 00054 ### eval_score: 0.26123\ntrial: 0004 ### iterations: 00149 ### eval_score: 0.2801\ntrial: 0005 ### iterations: 00149 ### eval_score: 0.27046\ntrial: 0006 ### iterations: 00149 ### eval_score: 0.24789\ntrial: 0007 ### iterations: 00054 ### eval_score: 0.25928\ntrial: 0008 ### iterations: 00140 ### eval_score: 0.27284\n","output_type":"stream"},{"execution_count":37,"output_type":"execute_result","data":{"text/plain":"BoostRFA(estimator=XGBRegressor(base_score=None, booster=None,\n colsample_bylevel=None, colsample_bynode=None,\n colsample_bytree=None, enable_categorical=False,\n gamma=None, gpu_id=None, importance_type=None,\n interaction_constraints=None,\n learning_rate=None, max_delta_step=None,\n max_depth=None, min_child_weight=None,\n missing=nan, monotone_constraints=None,\n n_estimato...\n random_state=0, reg_alpha=None, reg_lambda=None,\n scale_pos_weight=None, subsample=None,\n tree_method=None, validate_parameters=None,\n verbosity=0),\n min_features_to_select=1, n_iter=8,\n param_grid={'colsample_bytree': ,\n 'learning_rate': ,\n 'max_depth': },\n sampling_seed=0)"},"metadata":{}}]},{"cell_type":"code","source":"model.estimator_, model.best_params_, model.best_score_, model.n_features_","metadata":{"execution":{"iopub.status.busy":"2022-01-01T12:26:08.686184Z","iopub.execute_input":"2022-01-01T12:26:08.686931Z","iopub.status.idle":"2022-01-01T12:26:08.696854Z","shell.execute_reply.started":"2022-01-01T12:26:08.686898Z","shell.execute_reply":"2022-01-01T12:26:08.696004Z"},"trusted":true},"execution_count":38,"outputs":[{"execution_count":38,"output_type":"execute_result","data":{"text/plain":"(XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,\n colsample_bynode=1, colsample_bytree=0.7597292534356749,\n enable_categorical=False, gamma=0, gpu_id=-1, importance_type=None,\n interaction_constraints='', learning_rate=0.059836658149176665,\n max_delta_step=0, max_depth=16, min_child_weight=1, missing=nan,\n monotone_constraints='()', n_estimators=150, n_jobs=-1,\n num_parallel_tree=1, predictor='auto', random_state=0, reg_alpha=0,\n reg_lambda=1, scale_pos_weight=1, subsample=1, tree_method='exact',\n validate_parameters=1, verbosity=0),\n {'colsample_bytree': 0.7597292534356749,\n 'learning_rate': 0.059836658149176665,\n 'max_depth': 16},\n 0.247887,\n 8)"},"metadata":{}}]},{"cell_type":"code","source":"(model.score(X_regr_valid, y_regr_valid), \n model.predict(X_regr_valid).shape, \n model.transform(X_regr_valid).shape)","metadata":{"execution":{"iopub.status.busy":"2022-01-01T12:26:08.697934Z","iopub.execute_input":"2022-01-01T12:26:08.698155Z","iopub.status.idle":"2022-01-01T12:26:08.736781Z","shell.execute_reply.started":"2022-01-01T12:26:08.698128Z","shell.execute_reply":"2022-01-01T12:26:08.736145Z"},"trusted":true},"execution_count":39,"outputs":[{"execution_count":39,"output_type":"execute_result","data":{"text/plain":"(0.7542006308661441, (1800,), (1800, 8))"},"metadata":{}}]},{"cell_type":"markdown","source":"# Hyperparameters Tuning + Features Selection with SHAP","metadata":{}},{"cell_type":"code","source":"### HYPERPARAM TUNING WITH GRID-SEARCH + BORUTA SHAP ###\n\nmodel = BoostBoruta(\n clf_xgb, param_grid=param_grid, max_iter=200, perc=100,\n importance_type='shap_importances', train_importance=False\n)\nmodel.fit(X_clf_train, y_clf_train, eval_set=[(X_clf_valid, y_clf_valid)], early_stopping_rounds=6, verbose=0)","metadata":{"scrolled":true,"execution":{"iopub.status.busy":"2022-01-01T12:26:08.740222Z","iopub.execute_input":"2022-01-01T12:26:08.741848Z","iopub.status.idle":"2022-01-01T12:56:13.612807Z","shell.execute_reply.started":"2022-01-01T12:26:08.741813Z","shell.execute_reply":"2022-01-01T12:56:13.611991Z"},"trusted":true},"execution_count":40,"outputs":[{"name":"stdout","text":"\n8 trials detected for ('learning_rate', 'num_leaves', 'max_depth')\n\ntrial: 0001 ### iterations: 00024 ### eval_score: 0.20151\ntrial: 0002 ### iterations: 00020 ### eval_score: 0.20877\ntrial: 0003 ### iterations: 00024 ### eval_score: 0.20151\ntrial: 0004 ### iterations: 00020 ### eval_score: 0.20877\ntrial: 0005 ### iterations: 00048 ### eval_score: 0.20401\ntrial: 0006 ### iterations: 00048 ### eval_score: 0.20575\ntrial: 0007 ### iterations: 00048 ### eval_score: 0.20401\ntrial: 0008 ### iterations: 00048 ### eval_score: 0.20575\n","output_type":"stream"},{"execution_count":40,"output_type":"execute_result","data":{"text/plain":"BoostBoruta(estimator=XGBClassifier(base_score=None, booster=None,\n colsample_bylevel=None,\n colsample_bynode=None,\n colsample_bytree=None,\n enable_categorical=False, gamma=None,\n gpu_id=None, importance_type=None,\n interaction_constraints=None,\n learning_rate=None, max_delta_step=None,\n max_depth=None, min_child_weight=None,\n missing=nan, monotone_constraints=None,\n n_estimators=150, n_jobs=-1,\n num_parallel_tree=None, predictor=None,\n random_state=0, reg_alpha=None,\n reg_lambda=None, scale_pos_weight=None,\n subsample=None, tree_method=None,\n validate_parameters=None, verbosity=0),\n importance_type='shap_importances', max_iter=200,\n param_grid={'learning_rate': [0.2, 0.1], 'max_depth': [10, 12],\n 'num_leaves': [25, 35]},\n train_importance=False)"},"metadata":{}}]},{"cell_type":"code","source":"model.estimator_, model.best_params_, model.best_score_, model.n_features_","metadata":{"execution":{"iopub.status.busy":"2022-01-01T12:56:13.617168Z","iopub.execute_input":"2022-01-01T12:56:13.617372Z","iopub.status.idle":"2022-01-01T12:56:13.626563Z","shell.execute_reply.started":"2022-01-01T12:56:13.617349Z","shell.execute_reply":"2022-01-01T12:56:13.626036Z"},"trusted":true},"execution_count":41,"outputs":[{"execution_count":41,"output_type":"execute_result","data":{"text/plain":"(XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,\n colsample_bynode=1, colsample_bytree=1, enable_categorical=False,\n gamma=0, gpu_id=-1, importance_type=None,\n interaction_constraints='', learning_rate=0.2, max_delta_step=0,\n max_depth=10, min_child_weight=1, missing=nan,\n monotone_constraints='()', n_estimators=150, n_jobs=-1,\n num_leaves=25, num_parallel_tree=1, predictor='auto',\n random_state=0, reg_alpha=0, reg_lambda=1, scale_pos_weight=1,\n subsample=1, tree_method='exact', validate_parameters=1,\n verbosity=0),\n {'learning_rate': 0.2, 'num_leaves': 25, 'max_depth': 10},\n 0.201509,\n 10)"},"metadata":{}}]},{"cell_type":"code","source":"(model.score(X_clf_valid, y_clf_valid), \n model.predict(X_clf_valid).shape, \n model.transform(X_clf_valid).shape,\n model.predict_proba(X_clf_valid).shape)","metadata":{"execution":{"iopub.status.busy":"2022-01-01T12:56:13.627454Z","iopub.execute_input":"2022-01-01T12:56:13.627825Z","iopub.status.idle":"2022-01-01T12:56:13.665907Z","shell.execute_reply.started":"2022-01-01T12:56:13.627797Z","shell.execute_reply":"2022-01-01T12:56:13.664686Z"},"trusted":true},"execution_count":42,"outputs":[{"execution_count":42,"output_type":"execute_result","data":{"text/plain":"(0.9144444444444444, (1800,), (1800, 10), (1800, 2))"},"metadata":{}}]},{"cell_type":"code","source":"### HYPERPARAM TUNING WITH RANDOM-SEARCH + RECURSIVE FEATURE ELIMINATION (RFE) SHAP ###\n\nmodel = BoostRFE(\n regr_xgb, param_grid=param_dist, min_features_to_select=1, step=1,\n n_iter=8, sampling_seed=0,\n importance_type='shap_importances', train_importance=False\n)\nmodel.fit(X_regr_train, y_regr_train, eval_set=[(X_regr_valid, y_regr_valid)], early_stopping_rounds=6, verbose=0)","metadata":{"execution":{"iopub.status.busy":"2022-01-01T12:56:13.667149Z","iopub.execute_input":"2022-01-01T12:56:13.667539Z","iopub.status.idle":"2022-01-01T13:08:38.854835Z","shell.execute_reply.started":"2022-01-01T12:56:13.667509Z","shell.execute_reply":"2022-01-01T13:08:38.854142Z"},"trusted":true},"execution_count":43,"outputs":[{"name":"stdout","text":"\n8 trials detected for ('learning_rate', 'num_leaves', 'max_depth')\n\ntrial: 0001 ### iterations: 00021 ### eval_score: 0.25941\ntrial: 0002 ### iterations: 00064 ### eval_score: 0.25075\ntrial: 0003 ### iterations: 00075 ### eval_score: 0.25493\ntrial: 0004 ### iterations: 00084 ### eval_score: 0.25002\ntrial: 0005 ### iterations: 00093 ### eval_score: 0.25609\ntrial: 0006 ### iterations: 00039 ### eval_score: 0.2573\ntrial: 0007 ### iterations: 00074 ### eval_score: 0.25348\ntrial: 0008 ### iterations: 00032 ### eval_score: 0.2583\n","output_type":"stream"},{"execution_count":43,"output_type":"execute_result","data":{"text/plain":"BoostRFE(estimator=XGBRegressor(base_score=None, booster=None,\n colsample_bylevel=None, colsample_bynode=None,\n colsample_bytree=None, enable_categorical=False,\n gamma=None, gpu_id=None, importance_type=None,\n interaction_constraints=None,\n learning_rate=None, max_delta_step=None,\n max_depth=None, min_child_weight=None,\n missing=nan, monotone_constraints=None,\n n_estimato...\n tree_method=None, validate_parameters=None,\n verbosity=0),\n importance_type='shap_importances', min_features_to_select=1, n_iter=8,\n param_grid={'learning_rate': ,\n 'max_depth': [10, 12],\n 'num_leaves': },\n sampling_seed=0, train_importance=False)"},"metadata":{}}]},{"cell_type":"code","source":"model.estimator_, model.best_params_, model.best_score_, model.n_features_","metadata":{"execution":{"iopub.status.busy":"2022-01-01T13:08:38.855807Z","iopub.execute_input":"2022-01-01T13:08:38.856007Z","iopub.status.idle":"2022-01-01T13:08:38.866421Z","shell.execute_reply.started":"2022-01-01T13:08:38.855982Z","shell.execute_reply":"2022-01-01T13:08:38.865771Z"},"trusted":true},"execution_count":44,"outputs":[{"execution_count":44,"output_type":"execute_result","data":{"text/plain":"(XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,\n colsample_bynode=1, colsample_bytree=1, enable_categorical=False,\n gamma=0, gpu_id=-1, importance_type=None,\n interaction_constraints='', learning_rate=0.1669837381562427,\n max_delta_step=0, max_depth=10, min_child_weight=1, missing=nan,\n monotone_constraints='()', n_estimators=150, n_jobs=-1,\n num_leaves=25, num_parallel_tree=1, predictor='auto',\n random_state=0, reg_alpha=0, reg_lambda=1, scale_pos_weight=1,\n subsample=1, tree_method='exact', validate_parameters=1,\n verbosity=0),\n {'learning_rate': 0.1669837381562427, 'num_leaves': 25, 'max_depth': 10},\n 0.250021,\n 11)"},"metadata":{}}]},{"cell_type":"code","source":"(model.score(X_regr_valid, y_regr_valid), \n model.predict(X_regr_valid).shape, \n model.transform(X_regr_valid).shape)","metadata":{"execution":{"iopub.status.busy":"2022-01-01T13:08:38.867249Z","iopub.execute_input":"2022-01-01T13:08:38.867888Z","iopub.status.idle":"2022-01-01T13:08:38.887178Z","shell.execute_reply.started":"2022-01-01T13:08:38.867860Z","shell.execute_reply":"2022-01-01T13:08:38.886666Z"},"trusted":true},"execution_count":45,"outputs":[{"execution_count":45,"output_type":"execute_result","data":{"text/plain":"(0.7499501426259738, (1800,), (1800, 11))"},"metadata":{}}]},{"cell_type":"code","source":"### HYPERPARAM TUNING WITH HYPEROPT + RECURSIVE FEATURE ADDITION (RFA) SHAP ###\n\nmodel = BoostRFA(\n regr_xgb, param_grid=param_dist_hyperopt, min_features_to_select=1, step=1,\n n_iter=8, sampling_seed=0,\n importance_type='shap_importances', train_importance=False\n)\nmodel.fit(\n X_regr_train, y_regr_train, trials=Trials(), \n eval_set=[(X_regr_valid, y_regr_valid)], early_stopping_rounds=6, verbose=0\n)","metadata":{"execution":{"iopub.status.busy":"2022-01-01T13:08:38.890197Z","iopub.execute_input":"2022-01-01T13:08:38.891876Z","iopub.status.idle":"2022-01-01T13:41:32.886109Z","shell.execute_reply.started":"2022-01-01T13:08:38.891845Z","shell.execute_reply":"2022-01-01T13:41:32.885257Z"},"trusted":true},"execution_count":46,"outputs":[{"name":"stdout","text":"\n8 trials detected for ('max_depth', 'learning_rate', 'colsample_bytree')\n\ntrial: 0001 ### iterations: 00149 ### eval_score: 0.25811\ntrial: 0002 ### iterations: 00078 ### eval_score: 0.25554\ntrial: 0003 ### iterations: 00059 ### eval_score: 0.26658\ntrial: 0004 ### iterations: 00149 ### eval_score: 0.27356\ntrial: 0005 ### iterations: 00149 ### eval_score: 0.26426\ntrial: 0006 ### iterations: 00149 ### eval_score: 0.25537\ntrial: 0007 ### iterations: 00052 ### eval_score: 0.26107\ntrial: 0008 ### iterations: 00137 ### eval_score: 0.27787\n","output_type":"stream"},{"execution_count":46,"output_type":"execute_result","data":{"text/plain":"BoostRFA(estimator=XGBRegressor(base_score=None, booster=None,\n colsample_bylevel=None, colsample_bynode=None,\n colsample_bytree=None, enable_categorical=False,\n gamma=None, gpu_id=None, importance_type=None,\n interaction_constraints=None,\n learning_rate=None, max_delta_step=None,\n max_depth=None, min_child_weight=None,\n missing=nan, monotone_constraints=None,\n n_estimato...\n tree_method=None, validate_parameters=None,\n verbosity=0),\n importance_type='shap_importances', min_features_to_select=1, n_iter=8,\n param_grid={'colsample_bytree': ,\n 'learning_rate': ,\n 'max_depth': },\n sampling_seed=0, train_importance=False)"},"metadata":{}}]},{"cell_type":"code","source":"model.estimator_, model.best_params_, model.best_score_, model.n_features_","metadata":{"execution":{"iopub.status.busy":"2022-01-01T13:41:32.887300Z","iopub.execute_input":"2022-01-01T13:41:32.887495Z","iopub.status.idle":"2022-01-01T13:41:32.897203Z","shell.execute_reply.started":"2022-01-01T13:41:32.887472Z","shell.execute_reply":"2022-01-01T13:41:32.896455Z"},"trusted":true},"execution_count":47,"outputs":[{"execution_count":47,"output_type":"execute_result","data":{"text/plain":"(XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,\n colsample_bynode=1, colsample_bytree=0.7597292534356749,\n enable_categorical=False, gamma=0, gpu_id=-1, importance_type=None,\n interaction_constraints='', learning_rate=0.059836658149176665,\n max_delta_step=0, max_depth=16, min_child_weight=1, missing=nan,\n monotone_constraints='()', n_estimators=150, n_jobs=-1,\n num_parallel_tree=1, predictor='auto', random_state=0, reg_alpha=0,\n reg_lambda=1, scale_pos_weight=1, subsample=1, tree_method='exact',\n validate_parameters=1, verbosity=0),\n {'colsample_bytree': 0.7597292534356749,\n 'learning_rate': 0.059836658149176665,\n 'max_depth': 16},\n 0.255374,\n 11)"},"metadata":{}}]},{"cell_type":"code","source":"(model.score(X_regr_valid, y_regr_valid), \n model.predict(X_regr_valid).shape, \n model.transform(X_regr_valid).shape)","metadata":{"execution":{"iopub.status.busy":"2022-01-01T13:41:32.898201Z","iopub.execute_input":"2022-01-01T13:41:32.898493Z","iopub.status.idle":"2022-01-01T13:41:32.931801Z","shell.execute_reply.started":"2022-01-01T13:41:32.898469Z","shell.execute_reply":"2022-01-01T13:41:32.931131Z"},"trusted":true},"execution_count":48,"outputs":[{"execution_count":48,"output_type":"execute_result","data":{"text/plain":"(0.7391290836488575, (1800,), (1800, 11))"},"metadata":{}}]},{"cell_type":"markdown","source":"# CUSTOM EVAL METRIC SUPPORT","metadata":{}},{"cell_type":"code","source":"from sklearn.metrics import roc_auc_score\n\ndef AUC(y_hat, dtrain):\n y_true = dtrain.get_label()\n return 'auc', roc_auc_score(y_true, y_hat)","metadata":{"execution":{"iopub.status.busy":"2022-01-01T13:41:32.932773Z","iopub.execute_input":"2022-01-01T13:41:32.932979Z","iopub.status.idle":"2022-01-01T13:41:32.940277Z","shell.execute_reply.started":"2022-01-01T13:41:32.932952Z","shell.execute_reply":"2022-01-01T13:41:32.939659Z"},"trusted":true},"execution_count":49,"outputs":[]},{"cell_type":"code","source":"model = BoostRFE(\n clf_xgb, \n param_grid=param_grid, min_features_to_select=1, step=1,\n greater_is_better=True\n)\nmodel.fit(\n X_clf_train, y_clf_train, \n eval_set=[(X_clf_valid, y_clf_valid)], early_stopping_rounds=6, verbose=0,\n eval_metric=AUC\n)","metadata":{"execution":{"iopub.status.busy":"2022-01-01T13:41:32.943194Z","iopub.execute_input":"2022-01-01T13:41:32.944797Z","iopub.status.idle":"2022-01-01T13:43:50.574377Z","shell.execute_reply.started":"2022-01-01T13:41:32.944765Z","shell.execute_reply":"2022-01-01T13:43:50.573628Z"},"trusted":true},"execution_count":50,"outputs":[{"name":"stdout","text":"\n8 trials detected for ('learning_rate', 'num_leaves', 'max_depth')\n\ntrial: 0001 ### iterations: 00017 ### eval_score: 0.9757\ntrial: 0002 ### iterations: 00026 ### eval_score: 0.97632\ntrial: 0003 ### iterations: 00017 ### eval_score: 0.9757\ntrial: 0004 ### iterations: 00026 ### eval_score: 0.97632\ntrial: 0005 ### iterations: 00033 ### eval_score: 0.97594\ntrial: 0006 ### iterations: 00034 ### eval_score: 0.97577\ntrial: 0007 ### iterations: 00033 ### eval_score: 0.97594\ntrial: 0008 ### iterations: 00034 ### eval_score: 0.97577\n","output_type":"stream"},{"execution_count":50,"output_type":"execute_result","data":{"text/plain":"BoostRFE(estimator=XGBClassifier(base_score=None, booster=None,\n colsample_bylevel=None, colsample_bynode=None,\n colsample_bytree=None,\n enable_categorical=False, gamma=None,\n gpu_id=None, importance_type=None,\n interaction_constraints=None,\n learning_rate=None, max_delta_step=None,\n max_depth=None, min_child_weight=None,\n missing=nan, monotone_constraints=None,\n n_estimators=150, n_jobs=-1,\n num_parallel_tree=None, predictor=None,\n random_state=0, reg_alpha=None,\n reg_lambda=None, scale_pos_weight=None,\n subsample=None, tree_method=None,\n validate_parameters=None, verbosity=0),\n greater_is_better=True, min_features_to_select=1,\n param_grid={'learning_rate': [0.2, 0.1], 'max_depth': [10, 12],\n 'num_leaves': [25, 35]})"},"metadata":{}}]}]} ================================================ FILE: requirements.txt ================================================ numpy scipy scikit-learn>=0.24.1 shap>=0.39.0 hyperopt==0.2.5 ================================================ FILE: setup.py ================================================ import pathlib from setuptools import setup, find_packages HERE = pathlib.Path(__file__).parent VERSION = '0.2.7' PACKAGE_NAME = 'shap-hypetune' AUTHOR = 'Marco Cerliani' AUTHOR_EMAIL = 'cerlymarco@gmail.com' URL = 'https://github.com/cerlymarco/shap-hypetune' LICENSE = 'MIT' DESCRIPTION = 'A python package for simultaneous Hyperparameters Tuning and Features Selection for Gradient Boosting Models.' LONG_DESCRIPTION = (HERE / "README.md").read_text() LONG_DESC_TYPE = "text/markdown" INSTALL_REQUIRES = [ 'numpy', 'scipy', 'scikit-learn>=0.24.1', 'shap>=0.39.0', 'hyperopt==0.2.5' ] setup(name=PACKAGE_NAME, version=VERSION, description=DESCRIPTION, long_description=LONG_DESCRIPTION, long_description_content_type=LONG_DESC_TYPE, author=AUTHOR, license=LICENSE, author_email=AUTHOR_EMAIL, url=URL, install_requires=INSTALL_REQUIRES, python_requires='>=3', packages=find_packages() ) ================================================ FILE: shaphypetune/__init__.py ================================================ from .utils import * from ._classes import * from .shaphypetune import * ================================================ FILE: shaphypetune/_classes.py ================================================ import io import contextlib import warnings import numpy as np import scipy as sp from copy import deepcopy from sklearn.base import clone from sklearn.utils.validation import check_is_fitted from sklearn.base import BaseEstimator, TransformerMixin from joblib import Parallel, delayed from hyperopt import fmin, tpe from .utils import ParameterSampler, _check_param, _check_boosting from .utils import _set_categorical_indexes, _get_categorical_support from .utils import _feature_importances, _shap_importances class _BoostSearch(BaseEstimator): """Base class for BoostSearch meta-estimator. Warning: This class should not be used directly. Use derived classes instead. """ def __init__(self): pass def _validate_param_grid(self, fit_params): """Private method to validate fitting parameters.""" if not isinstance(self.param_grid, dict): raise ValueError("Pass param_grid in dict format.") self._param_grid = self.param_grid.copy() for p_k, p_v in self._param_grid.items(): self._param_grid[p_k] = _check_param(p_v) if 'eval_set' not in fit_params: raise ValueError( "When tuning parameters, at least " "a evaluation set is required.") self._eval_score = np.argmax if self.greater_is_better else np.argmin self._score_sign = -1 if self.greater_is_better else 1 rs = ParameterSampler( n_iter=self.n_iter, param_distributions=self._param_grid, random_state=self.sampling_seed ) self._param_combi, self._tuning_type = rs.sample() self._trial_id = 1 if self.verbose > 0: n_trials = self.n_iter if self._tuning_type is 'hyperopt' \ else len(self._param_combi) print("\n{} trials detected for {}\n".format( n_trials, tuple(self.param_grid.keys()))) def _fit(self, X, y, fit_params, params=None): """Private method to fit a single boosting model and extract results.""" model = self._build_model(params) if isinstance(model, _BoostSelector): model.fit(X=X, y=y, **fit_params) else: with contextlib.redirect_stdout(io.StringIO()): model.fit(X=X, y=y, **fit_params) results = {'params': params, 'status': 'ok'} if isinstance(model, _BoostSelector): results['booster'] = model.estimator_ results['model'] = model else: results['booster'] = model results['model'] = None if 'eval_set' not in fit_params: return results if self.boost_type_ == 'XGB': # w/ eval_set and w/ early_stopping_rounds if hasattr(results['booster'], 'best_score'): results['iterations'] = results['booster'].best_iteration # w/ eval_set and w/o early_stopping_rounds else: valid_id = list(results['booster'].evals_result_.keys())[-1] eval_metric = list(results['booster'].evals_result_[valid_id])[-1] results['iterations'] = \ len(results['booster'].evals_result_[valid_id][eval_metric]) else: # w/ eval_set and w/ early_stopping_rounds if results['booster'].best_iteration_ is not None: results['iterations'] = results['booster'].best_iteration_ # w/ eval_set and w/o early_stopping_rounds else: valid_id = list(results['booster'].evals_result_.keys())[-1] eval_metric = list(results['booster'].evals_result_[valid_id])[-1] results['iterations'] = \ len(results['booster'].evals_result_[valid_id][eval_metric]) if self.boost_type_ == 'XGB': # w/ eval_set and w/ early_stopping_rounds if hasattr(results['booster'], 'best_score'): results['loss'] = results['booster'].best_score # w/ eval_set and w/o early_stopping_rounds else: valid_id = list(results['booster'].evals_result_.keys())[-1] eval_metric = list(results['booster'].evals_result_[valid_id])[-1] results['loss'] = \ results['booster'].evals_result_[valid_id][eval_metric][-1] else: valid_id = list(results['booster'].best_score_.keys())[-1] eval_metric = list(results['booster'].best_score_[valid_id])[-1] results['loss'] = results['booster'].best_score_[valid_id][eval_metric] if params is not None: if self.verbose > 0: msg = "trial: {} ### iterations: {} ### eval_score: {}".format( str(self._trial_id).zfill(4), str(results['iterations']).zfill(5), round(results['loss'], 5) ) print(msg) self._trial_id += 1 results['loss'] *= self._score_sign return results def fit(self, X, y, trials=None, **fit_params): """Fit the provided boosting algorithm while searching the best subset of features (according to the selected strategy) and choosing the best parameters configuration (if provided). It takes the same arguments available in the estimator fit. Parameters ---------- X : array-like of shape (n_samples, n_features) The training input samples. y : array-like of shape (n_samples,) Target values. trials : hyperopt.Trials() object, default=None A hyperopt trials object, used to store intermediate results for all optimization runs. Effective (and required) only when hyperopt parameter searching is computed. **fit_params : Additional fitting arguments. Returns ------- self : object """ self.boost_type_ = _check_boosting(self.estimator) if self.param_grid is None: results = self._fit(X, y, fit_params) for v in vars(results['model']): if v.endswith("_") and not v.startswith("__"): setattr(self, str(v), getattr(results['model'], str(v))) else: self._validate_param_grid(fit_params) if self._tuning_type == 'hyperopt': if trials is None: raise ValueError( "trials must be not None when using hyperopt." ) search = fmin( fn=lambda p: self._fit( params=p, X=X, y=y, fit_params=fit_params ), space=self._param_combi, algo=tpe.suggest, max_evals=self.n_iter, trials=trials, rstate=np.random.RandomState(self.sampling_seed), show_progressbar=False, verbose=0 ) all_results = trials.results else: all_results = Parallel( n_jobs=self.n_jobs, verbose=self.verbose * int(bool(self.n_jobs)) )(delayed(self._fit)(X, y, fit_params, params) for params in self._param_combi) # extract results from parallel loops self.trials_, self.iterations_, self.scores_, models = [], [], [], [] for job_res in all_results: self.trials_.append(job_res['params']) self.iterations_.append(job_res['iterations']) self.scores_.append(self._score_sign * job_res['loss']) if isinstance(job_res['model'], _BoostSelector): models.append(job_res['model']) else: models.append(job_res['booster']) # get the best id_best = self._eval_score(self.scores_) self.best_params_ = self.trials_[id_best] self.best_iter_ = self.iterations_[id_best] self.best_score_ = self.scores_[id_best] self.estimator_ = models[id_best] for v in vars(models[id_best]): if v.endswith("_") and not v.startswith("__"): setattr(self, str(v), getattr(models[id_best], str(v))) return self def predict(self, X, **predict_params): """Predict X. Parameters ---------- X : array-like of shape (n_samples, n_features) Samples. **predict_params : Additional predict arguments. Returns ------- pred : ndarray of shape (n_samples,) The predicted values. """ check_is_fitted(self) if hasattr(self, 'transform'): X = self.transform(X) return self.estimator_.predict(X, **predict_params) def predict_proba(self, X, **predict_params): """Predict X probabilities. Parameters ---------- X : array-like of shape (n_samples, n_features) Samples. **predict_params : Additional predict arguments. Returns ------- pred : ndarray of shape (n_samples, n_classes) The predicted values. """ check_is_fitted(self) # raise original AttributeError getattr(self.estimator_, 'predict_proba') if hasattr(self, 'transform'): X = self.transform(X) return self.estimator_.predict_proba(X, **predict_params) def score(self, X, y, sample_weight=None): """Return the score on the given test data and labels. Parameters ---------- X : array-like of shape (n_samples, n_features) Test samples. y : array-like of shape (n_samples,) True values for X. sample_weight : array-like of shape (n_samples,), default=None Sample weights. Returns ------- score : float Accuracy for classification, R2 for regression. """ check_is_fitted(self) if hasattr(self, 'transform'): X = self.transform(X) return self.estimator_.score(X, y, sample_weight=sample_weight) class _BoostSelector(BaseEstimator, TransformerMixin): """Base class for feature selection meta-estimator. Warning: This class should not be used directly. Use derived classes instead. """ def __init__(self): pass def transform(self, X): """Reduces the input X to the features selected by Boruta. Parameters ---------- X : array-like of shape (n_samples, n_features) Samples. Returns ------- X : array-like of shape (n_samples, n_features_) The input samples with only the selected features by Boruta. """ check_is_fitted(self) shapes = np.shape(X) if len(shapes) != 2: raise ValueError("X must be 2D.") if shapes[1] != self.support_.shape[0]: raise ValueError( "Expected {} features, received {}.".format( self.support_.shape[0], shapes[1])) if isinstance(X, np.ndarray): return X[:, self.support_] elif hasattr(X, 'loc'): return X.loc[:, self.support_] else: raise ValueError("Data type not understood.") def get_support(self, indices=False): """Get a mask, or integer index, of the features selected. Parameters ---------- indices : bool, default=False If True, the return value will be an array of integers, rather than a boolean mask. Returns ------- support : array An index that selects the retained features from a feature vector. If `indices` is False, this is a boolean array of shape [# input features], in which an element is True iff its corresponding feature is selected for retention. If `indices` is True, this is an integer array of shape [# output features] whose values are indices into the input feature vector. """ check_is_fitted(self) mask = self.support_ return mask if not indices else np.where(mask)[0] class _Boruta(_BoostSelector): """Base class for BoostBoruta meta-estimator. Warning: This class should not be used directly. Use derived classes instead. Notes ----- The code for the Boruta algorithm is inspired and improved from: https://github.com/scikit-learn-contrib/boruta_py """ def __init__(self, estimator, *, perc=100, alpha=0.05, max_iter=100, early_stopping_boruta_rounds=None, importance_type='feature_importances', train_importance=True, verbose=0): self.estimator = estimator self.perc = perc self.alpha = alpha self.max_iter = max_iter self.early_stopping_boruta_rounds = early_stopping_boruta_rounds self.importance_type = importance_type self.train_importance = train_importance self.verbose = verbose def _create_X(self, X, feat_id_real): """Private method to add shadow features to the original ones. """ if isinstance(X, np.ndarray): X_real = X[:, feat_id_real].copy() X_sha = X_real.copy() X_sha = np.apply_along_axis(self._random_state.permutation, 0, X_sha) X = np.hstack((X_real, X_sha)) elif hasattr(X, 'iloc'): X_real = X.iloc[:, feat_id_real].copy() X_sha = X_real.copy() X_sha = X_sha.apply(self._random_state.permutation) X_sha = X_sha.astype(X_real.dtypes) X = X_real.join(X_sha, rsuffix='_SHA') else: raise ValueError("Data type not understood.") return X def _check_fit_params(self, fit_params, feat_id_real=None): """Private method to validate and check fit_params.""" _fit_params = deepcopy(fit_params) estimator = clone(self.estimator) # add here possible estimator checks in each iteration _fit_params = _set_categorical_indexes( self.support_, self._cat_support, _fit_params, duplicate=True) if feat_id_real is None: # final model fit if 'eval_set' in _fit_params: _fit_params['eval_set'] = list(map(lambda x: ( self.transform(x[0]), x[1] ), _fit_params['eval_set'])) else: if 'eval_set' in _fit_params: # iterative model fit _fit_params['eval_set'] = list(map(lambda x: ( self._create_X(x[0], feat_id_real), x[1] ), _fit_params['eval_set'])) if 'feature_name' in _fit_params: # LGB _fit_params['feature_name'] = 'auto' if 'feature_weights' in _fit_params: # XGB import warnings warnings.warn( "feature_weights is not supported when selecting features. " "It's automatically set to None.") _fit_params['feature_weights'] = None return _fit_params, estimator def _do_tests(self, dec_reg, hit_reg, iter_id): """Private method to operate Bonferroni corrections on the feature selections.""" active_features = np.where(dec_reg >= 0)[0] hits = hit_reg[active_features] # get uncorrected p values based on hit_reg to_accept_ps = sp.stats.binom.sf(hits - 1, iter_id, .5).flatten() to_reject_ps = sp.stats.binom.cdf(hits, iter_id, .5).flatten() # Bonferroni correction with the total n_features in each iteration to_accept = to_accept_ps <= self.alpha / float(len(dec_reg)) to_reject = to_reject_ps <= self.alpha / float(len(dec_reg)) # find features which are 0 and have been rejected or accepted to_accept = np.where((dec_reg[active_features] == 0) * to_accept)[0] to_reject = np.where((dec_reg[active_features] == 0) * to_reject)[0] # updating dec_reg dec_reg[active_features[to_accept]] = 1 dec_reg[active_features[to_reject]] = -1 return dec_reg def fit(self, X, y, **fit_params): """Fit the Boruta algorithm to automatically tune the number of selected features.""" self.boost_type_ = _check_boosting(self.estimator) if self.max_iter < 1: raise ValueError('max_iter should be an integer >0.') if self.perc <= 0 or self.perc > 100: raise ValueError('The percentile should be between 0 and 100.') if self.alpha <= 0 or self.alpha > 1: raise ValueError('alpha should be between 0 and 1.') if self.early_stopping_boruta_rounds is None: es_boruta_rounds = self.max_iter else: if self.early_stopping_boruta_rounds < 1: raise ValueError( 'early_stopping_boruta_rounds should be an integer >0.') es_boruta_rounds = self.early_stopping_boruta_rounds importances = ['feature_importances', 'shap_importances'] if self.importance_type not in importances: raise ValueError( "importance_type must be one of {}. Get '{}'".format( importances, self.importance_type)) if self.importance_type == 'shap_importances': if not self.train_importance and not 'eval_set' in fit_params: raise ValueError( "When train_importance is set to False, using " "shap_importances, pass at least a eval_set.") eval_importance = not self.train_importance and 'eval_set' in fit_params shapes = np.shape(X) if len(shapes) != 2: raise ValueError("X must be 2D.") n_features = shapes[1] # create mask for user-defined categorical features self._cat_support = _get_categorical_support(n_features, fit_params) # holds the decision about each feature: # default (0); accepted (1); rejected (-1) dec_reg = np.zeros(n_features, dtype=int) dec_history = np.zeros((self.max_iter, n_features), dtype=int) # counts how many times a given feature was more important than # the best of the shadow features hit_reg = np.zeros(n_features, dtype=int) # record the history of the iterations imp_history = np.zeros(n_features, dtype=float) sha_max_history = [] for i in range(self.max_iter): if (dec_reg != 0).all(): if self.verbose > 1: print("All Features analyzed. Boruta stop!") break if self.verbose > 1: print('Iteration: {} / {}'.format(i + 1, self.max_iter)) self._random_state = np.random.RandomState(i + 1000) # add shadow attributes, shuffle and train estimator self.support_ = dec_reg >= 0 feat_id_real = np.where(self.support_)[0] n_real = feat_id_real.shape[0] _fit_params, estimator = self._check_fit_params(fit_params, feat_id_real) estimator.set_params(random_state=i + 1000) _X = self._create_X(X, feat_id_real) with contextlib.redirect_stdout(io.StringIO()): estimator.fit(_X, y, **_fit_params) # get coefs if self.importance_type == 'feature_importances': coefs = _feature_importances(estimator) else: if eval_importance: coefs = _shap_importances( estimator, _fit_params['eval_set'][-1][0]) else: coefs = _shap_importances(estimator, _X) # separate importances of real and shadow features imp_sha = coefs[n_real:] imp_real = np.zeros(n_features) * np.nan imp_real[feat_id_real] = coefs[:n_real] # get the threshold of shadow importances used for rejection imp_sha_max = np.percentile(imp_sha, self.perc) # record importance history sha_max_history.append(imp_sha_max) imp_history = np.vstack((imp_history, imp_real)) # register which feature is more imp than the max of shadows hit_reg[np.where(imp_real[~np.isnan(imp_real)] > imp_sha_max)[0]] += 1 # check if a feature is doing better than expected by chance dec_reg = self._do_tests(dec_reg, hit_reg, i + 1) dec_history[i] = dec_reg es_id = i - es_boruta_rounds if es_id >= 0: if np.equal(dec_history[es_id:(i + 1)], dec_reg).all(): if self.verbose > 0: print("Boruta early stopping at iteration {}".format(i + 1)) break confirmed = np.where(dec_reg == 1)[0] tentative = np.where(dec_reg == 0)[0] self.support_ = np.zeros(n_features, dtype=bool) self.ranking_ = np.ones(n_features, dtype=int) * 4 self.n_features_ = confirmed.shape[0] self.importance_history_ = imp_history[1:] if tentative.shape[0] > 0: tentative_median = np.nanmedian(imp_history[1:, tentative], axis=0) tentative_low = tentative[ np.where(tentative_median <= np.median(sha_max_history))[0]] tentative_up = np.setdiff1d(tentative, tentative_low) self.ranking_[tentative_low] = 3 if tentative_up.shape[0] > 0: self.ranking_[tentative_up] = 2 if confirmed.shape[0] > 0: self.support_[confirmed] = True self.ranking_[confirmed] = 1 if (~self.support_).all(): raise RuntimeError( "Boruta didn't select any feature. Try to increase max_iter or " "increase (if not None) early_stopping_boruta_rounds or " "decrese perc.") _fit_params, self.estimator_ = self._check_fit_params(fit_params) with contextlib.redirect_stdout(io.StringIO()): self.estimator_.fit(self.transform(X), y, **_fit_params) return self class _RFE(_BoostSelector): """Base class for BoostRFE meta-estimator. Warning: This class should not be used directly. Use derived classes instead. """ def __init__(self, estimator, *, min_features_to_select=None, step=1, greater_is_better=False, importance_type='feature_importances', train_importance=True, verbose=0): self.estimator = estimator self.min_features_to_select = min_features_to_select self.step = step self.greater_is_better = greater_is_better self.importance_type = importance_type self.train_importance = train_importance self.verbose = verbose def _check_fit_params(self, fit_params): """Private method to validate and check fit_params.""" _fit_params = deepcopy(fit_params) estimator = clone(self.estimator) # add here possible estimator checks in each iteration _fit_params = _set_categorical_indexes( self.support_, self._cat_support, _fit_params) if 'eval_set' in _fit_params: _fit_params['eval_set'] = list(map(lambda x: ( self.transform(x[0]), x[1] ), _fit_params['eval_set'])) if 'feature_name' in _fit_params: # LGB _fit_params['feature_name'] = 'auto' if 'feature_weights' in _fit_params: # XGB import warnings warnings.warn( "feature_weights is not supported when selecting features. " "It's automatically set to None.") _fit_params['feature_weights'] = None return _fit_params, estimator def _step_score(self, estimator): """Return the score for a fit on eval_set.""" if self.boost_type_ == 'LGB': valid_id = list(estimator.best_score_.keys())[-1] eval_metric = list(estimator.best_score_[valid_id])[-1] score = estimator.best_score_[valid_id][eval_metric] else: # w/ eval_set and w/ early_stopping_rounds if hasattr(estimator, 'best_score'): score = estimator.best_score # w/ eval_set and w/o early_stopping_rounds else: valid_id = list(estimator.evals_result_.keys())[-1] eval_metric = list(estimator.evals_result_[valid_id])[-1] score = estimator.evals_result_[valid_id][eval_metric][-1] return score def fit(self, X, y, **fit_params): """Fit the RFE algorithm to automatically tune the number of selected features.""" self.boost_type_ = _check_boosting(self.estimator) importances = ['feature_importances', 'shap_importances'] if self.importance_type not in importances: raise ValueError( "importance_type must be one of {}. Get '{}'".format( importances, self.importance_type)) # scoring controls the calculation of self.score_history_ # scoring is used automatically when 'eval_set' is in fit_params scoring = 'eval_set' in fit_params if self.importance_type == 'shap_importances': if not self.train_importance and not scoring: raise ValueError( "When train_importance is set to False, using " "shap_importances, pass at least a eval_set.") eval_importance = not self.train_importance and scoring shapes = np.shape(X) if len(shapes) != 2: raise ValueError("X must be 2D.") n_features = shapes[1] # create mask for user-defined categorical features self._cat_support = _get_categorical_support(n_features, fit_params) if self.min_features_to_select is None: if scoring: min_features_to_select = 1 else: min_features_to_select = n_features // 2 else: min_features_to_select = self.min_features_to_select if 0.0 < self.step < 1.0: step = int(max(1, self.step * n_features)) else: step = int(self.step) if step <= 0: raise ValueError("Step must be >0.") self.support_ = np.ones(n_features, dtype=bool) self.ranking_ = np.ones(n_features, dtype=int) if scoring: self.score_history_ = [] eval_score = np.max if self.greater_is_better else np.min best_score = -np.inf if self.greater_is_better else np.inf while np.sum(self.support_) > min_features_to_select: # remaining features features = np.arange(n_features)[self.support_] _fit_params, estimator = self._check_fit_params(fit_params) if self.verbose > 1: print("Fitting estimator with {} features".format( self.support_.sum())) with contextlib.redirect_stdout(io.StringIO()): estimator.fit(self.transform(X), y, **_fit_params) # get coefs if self.importance_type == 'feature_importances': coefs = _feature_importances(estimator) else: if eval_importance: coefs = _shap_importances( estimator, _fit_params['eval_set'][-1][0]) else: coefs = _shap_importances( estimator, self.transform(X)) ranks = np.argsort(coefs) # eliminate the worse features threshold = min(step, np.sum(self.support_) - min_features_to_select) # compute step score on the previous selection iteration # because 'estimator' must use features # that have not been eliminated yet if scoring: score = self._step_score(estimator) self.score_history_.append(score) if best_score != eval_score([score, best_score]): best_score = score best_support = self.support_.copy() best_ranking = self.ranking_.copy() best_estimator = estimator self.support_[features[ranks][:threshold]] = False self.ranking_[np.logical_not(self.support_)] += 1 # set final attributes _fit_params, self.estimator_ = self._check_fit_params(fit_params) if self.verbose > 1: print("Fitting estimator with {} features".format(self.support_.sum())) with contextlib.redirect_stdout(io.StringIO()): self.estimator_.fit(self.transform(X), y, **_fit_params) # compute step score when only min_features_to_select features left if scoring: score = self._step_score(self.estimator_) self.score_history_.append(score) if best_score == eval_score([score, best_score]): self.support_ = best_support self.ranking_ = best_ranking self.estimator_ = best_estimator self.n_features_ = self.support_.sum() return self class _RFA(_BoostSelector): """Base class for BoostRFA meta-estimator. Warning: This class should not be used directly. Use derived classes instead. """ def __init__(self, estimator, *, min_features_to_select=None, step=1, greater_is_better=False, importance_type='feature_importances', train_importance=True, verbose=0): self.estimator = estimator self.min_features_to_select = min_features_to_select self.step = step self.greater_is_better = greater_is_better self.importance_type = importance_type self.train_importance = train_importance self.verbose = verbose def _check_fit_params(self, fit_params, inverse=False): """Private method to validate and check fit_params.""" _fit_params = deepcopy(fit_params) estimator = clone(self.estimator) # add here possible estimator checks in each iteration _fit_params = _set_categorical_indexes( self.support_, self._cat_support, _fit_params) if 'eval_set' in _fit_params: _fit_params['eval_set'] = list(map(lambda x: ( self._transform(x[0], inverse), x[1] ), _fit_params['eval_set'])) if 'feature_name' in _fit_params: # LGB _fit_params['feature_name'] = 'auto' if 'feature_weights' in _fit_params: # XGB import warnings warnings.warn( "feature_weights is not supported when selecting features. " "It's automatically set to None.") _fit_params['feature_weights'] = None return _fit_params, estimator def _step_score(self, estimator): """Return the score for a fit on eval_set.""" if self.boost_type_ == 'LGB': valid_id = list(estimator.best_score_.keys())[-1] eval_metric = list(estimator.best_score_[valid_id])[-1] score = estimator.best_score_[valid_id][eval_metric] else: # w/ eval_set and w/ early_stopping_rounds if hasattr(estimator, 'best_score'): score = estimator.best_score # w/ eval_set and w/o early_stopping_rounds else: valid_id = list(estimator.evals_result_.keys())[-1] eval_metric = list(estimator.evals_result_[valid_id])[-1] score = estimator.evals_result_[valid_id][eval_metric][-1] return score def fit(self, X, y, **fit_params): """Fit the RFA algorithm to automatically tune the number of selected features.""" self.boost_type_ = _check_boosting(self.estimator) importances = ['feature_importances', 'shap_importances'] if self.importance_type not in importances: raise ValueError( "importance_type must be one of {}. Get '{}'".format( importances, self.importance_type)) # scoring controls the calculation of self.score_history_ # scoring is used automatically when 'eval_set' is in fit_params scoring = 'eval_set' in fit_params if self.importance_type == 'shap_importances': if not self.train_importance and not scoring: raise ValueError( "When train_importance is set to False, using " "shap_importances, pass at least a eval_set.") eval_importance = not self.train_importance and scoring shapes = np.shape(X) if len(shapes) != 2: raise ValueError("X must be 2D.") n_features = shapes[1] # create mask for user-defined categorical features self._cat_support = _get_categorical_support(n_features, fit_params) if self.min_features_to_select is None: if scoring: min_features_to_select = 1 else: min_features_to_select = n_features // 2 else: if scoring: min_features_to_select = self.min_features_to_select else: min_features_to_select = n_features - self.min_features_to_select if 0.0 < self.step < 1.0: step = int(max(1, self.step * n_features)) else: step = int(self.step) if step <= 0: raise ValueError("Step must be >0.") self.support_ = np.zeros(n_features, dtype=bool) self._support = np.ones(n_features, dtype=bool) self.ranking_ = np.ones(n_features, dtype=int) self._ranking = np.ones(n_features, dtype=int) if scoring: self.score_history_ = [] eval_score = np.max if self.greater_is_better else np.min best_score = -np.inf if self.greater_is_better else np.inf while np.sum(self._support) > min_features_to_select: # remaining features features = np.arange(n_features)[self._support] # scoring the previous added features if scoring and np.sum(self.support_) > 0: _fit_params, estimator = self._check_fit_params(fit_params) with contextlib.redirect_stdout(io.StringIO()): estimator.fit(self._transform(X, inverse=False), y, **_fit_params) score = self._step_score(estimator) self.score_history_.append(score) if best_score != eval_score([score, best_score]): best_score = score best_support = self.support_.copy() best_ranking = self.ranking_.copy() best_estimator = estimator # evaluate the remaining features _fit_params, _estimator = self._check_fit_params(fit_params, inverse=True) if self.verbose > 1: print("Fitting estimator with {} features".format(self._support.sum())) with contextlib.redirect_stdout(io.StringIO()): _estimator.fit(self._transform(X, inverse=True), y, **_fit_params) if self._support.sum() == n_features: all_features_estimator = _estimator # get coefs if self.importance_type == 'feature_importances': coefs = _feature_importances(_estimator) else: if eval_importance: coefs = _shap_importances( _estimator, _fit_params['eval_set'][-1][0]) else: coefs = _shap_importances( _estimator, self._transform(X, inverse=True)) ranks = np.argsort(-coefs) # the rank is inverted # add the best features threshold = min(step, np.sum(self._support) - min_features_to_select) # remaining features to test self._support[features[ranks][:threshold]] = False self._ranking[np.logical_not(self._support)] += 1 # features tested self.support_[features[ranks][:threshold]] = True self.ranking_[np.logical_not(self.support_)] += 1 # set final attributes _fit_params, self.estimator_ = self._check_fit_params(fit_params) if self.verbose > 1: print("Fitting estimator with {} features".format(self._support.sum())) with contextlib.redirect_stdout(io.StringIO()): self.estimator_.fit(self._transform(X, inverse=False), y, **_fit_params) # compute step score when only min_features_to_select features left if scoring: score = self._step_score(self.estimator_) self.score_history_.append(score) if best_score == eval_score([score, best_score]): self.support_ = best_support self.ranking_ = best_ranking self.estimator_ = best_estimator if len(set(self.score_history_)) == 1: self.support_ = np.ones(n_features, dtype=bool) self.ranking_ = np.ones(n_features, dtype=int) self.estimator_ = all_features_estimator self.n_features_ = self.support_.sum() return self def _transform(self, X, inverse=False): """Private method to reduce the input X to the features selected.""" shapes = np.shape(X) if len(shapes) != 2: raise ValueError("X must be 2D.") if shapes[1] != self.support_.shape[0]: raise ValueError( "Expected {} features, received {}.".format( self.support_.shape[0], shapes[1])) if inverse: if isinstance(X, np.ndarray): return X[:, self._support] elif hasattr(X, 'loc'): return X.loc[:, self._support] elif sp.sparse.issparse(X): return X[:, self._support] else: raise ValueError("Data type not understood.") else: if isinstance(X, np.ndarray): return X[:, self.support_] elif hasattr(X, 'loc'): return X.loc[:, self.support_] elif sp.sparse.issparse(X): return X[:, self.support_] else: raise ValueError("Data type not understood.") def transform(self, X): """Reduces the input X to the features selected with RFA. Parameters ---------- X : array-like of shape (n_samples, n_features) Samples. Returns ------- X : array-like of shape (n_samples, n_features_) The input samples with only the selected features by Boruta. """ check_is_fitted(self) return self._transform(X, inverse=False) ================================================ FILE: shaphypetune/shaphypetune.py ================================================ from sklearn.base import clone from ._classes import _BoostSearch, _Boruta, _RFA, _RFE class BoostSearch(_BoostSearch): """Hyperparamater searching and optimization on a given validation set for LGBModel or XGBModel. Pass a LGBModel or XGBModel, and a dictionary with the parameter boundaries for grid, random or bayesian search. To operate random search pass distributions in the param_grid with rvs method for sampling (such as those from scipy.stats.distributions). To operate bayesian search pass hyperopt distributions. The specification of n_iter or sampling_seed is effective only with random or hyperopt searches. The best parameter combination is the one which obtain the better score (as returned by eval_metric) on the provided eval_set. If all parameters are presented as a list/floats/integers, grid-search is performed. If at least one parameter is given as a distribution (such as those from scipy.stats.distributions), random-search is performed computing sampling with replacement. Bayesian search is effective only when all the parameters to tune are in form of hyperopt distributions. It is highly recommended to use continuous distributions for continuous parameters. Parameters ---------- estimator : object A supervised learning estimator of LGBModel or XGBModel type. param_grid : dict Dictionary with parameters names (`str`) as keys and distributions or lists of parameters to try. greater_is_better : bool, default=False Whether the quantity to monitor is a score function, meaning high is good, or a loss function, meaning low is good. n_iter : int, default=None Effective only for random or hyperopt search. Number of parameter settings that are sampled. n_iter trades off runtime vs quality of the solution. sampling_seed : int, default=None Effective only for random or hyperopt search. The seed used to sample from the hyperparameter distributions. n_jobs : int, default=None Effective only with grid and random search. The number of jobs to run in parallel for model fitting. ``None`` means 1 using one processor. ``-1`` means using all processors. verbose : int, default=1 Verbosity mode. <=0 silent all; >0 print trial logs with the connected score. Attributes ---------- estimator_ : estimator Estimator that was chosen by the search, i.e. estimator which gave the best score on the eval_set. best_params_ : dict Parameter setting that gave the best results on the eval_set. trials_ : list A list of dicts. The dicts are all the parameter combinations tried and derived from the param_grid. best_score_ : float The best score achieved by all the possible combination created. scores_ : list The scores achieved on the eval_set by all the models tried. best_iter_ : int The boosting iterations achieved by the best parameters combination. iterations_ : list The boosting iterations of all the models tried. boost_type_ : str The type of the boosting estimator (LGB or XGB). """ def __init__(self, estimator, *, param_grid, greater_is_better=False, n_iter=None, sampling_seed=None, verbose=1, n_jobs=None): self.estimator = estimator self.param_grid = param_grid self.greater_is_better = greater_is_better self.n_iter = n_iter self.sampling_seed = sampling_seed self.verbose = verbose self.n_jobs = n_jobs def _build_model(self, params): """Private method to build model.""" model = clone(self.estimator) model.set_params(**params) return model class BoostBoruta(_BoostSearch, _Boruta): """Simultaneous features selection with Boruta algorithm and hyperparamater searching on a given validation set for LGBModel or XGBModel. Pass a LGBModel or XGBModel to compute features selection with Boruta algorithm. The best features are used to train a new gradient boosting instance. When a eval_set is provided, shadow features are build also on it. If param_grid is a dictionary with parameter boundaries, a hyperparameter tuning is computed simultaneously. The parameter combinations are scored on the provided eval_set. To operate random search pass distributions in the param_grid with rvs method for sampling (such as those from scipy.stats.distributions). To operate bayesian search pass hyperopt distributions. The specification of n_iter or sampling_seed is effective only with random or hyperopt searches. The best parameter combination is the one which obtain the better score (as returned by eval_metric) on the provided eval_set. If all parameters are presented as a list/floats/integers, grid-search is performed. If at least one parameter is given as a distribution (such as those from scipy.stats.distributions), random-search is performed computing sampling with replacement. Bayesian search is effective only when all the parameters to tune are in form of hyperopt distributions. It is highly recommended to use continuous distributions for continuous parameters. Parameters ---------- estimator : object A supervised learning estimator of LGBModel or XGBModel type. perc : int, default=100 Threshold for comparison between shadow and real features. The lower perc is the more false positives will be picked as relevant but also the less relevant features will be left out. 100 correspond to the max. alpha : float, default=0.05 Level at which the corrected p-values will get rejected in the correction steps. max_iter : int, default=100 The number of maximum Boruta iterations to perform. early_stopping_boruta_rounds : int, default=None The maximum amount of iterations without confirming a tentative feature. Use early stopping to terminate the selection process before reaching `max_iter` iterations if the algorithm cannot confirm a tentative feature after N iterations. None means no early stopping search. importance_type : str, default='feature_importances' Which importance measure to use. It can be 'feature_importances' (the default feature importance of the gradient boosting estimator) or 'shap_importances'. train_importance : bool, default=True Effective only when importance_type='shap_importances'. Where to compute the shap feature importance: on train (True) or on eval_set (False). param_grid : dict, default=None Dictionary with parameters names (`str`) as keys and distributions or lists of parameters to try. None means no hyperparameters search. greater_is_better : bool, default=False Effective only when hyperparameters searching. Whether the quantity to monitor is a score function, meaning high is good, or a loss function, meaning low is good. n_iter : int, default=None Effective only when hyperparameters searching. Effective only for random or hyperopt seraches. Number of parameter settings that are sampled. n_iter trades off runtime vs quality of the solution. sampling_seed : int, default=None Effective only when hyperparameters searching. Effective only for random or hyperopt serach. The seed used to sample from the hyperparameter distributions. n_jobs : int, default=None Effective only when hyperparameters searching without hyperopt. The number of jobs to run in parallel for model fitting. ``None`` means 1 using one processor. ``-1`` means using all processors. verbose : int, default=1 Verbosity mode. <=0 silent all; ==1 print trial logs (when hyperparameters searching); >1 print feature selection logs plus trial logs (when hyperparameters searching). Attributes ---------- estimator_ : estimator The fitted estimator with the select features and the optimal parameter combination (when hyperparameters searching). n_features_ : int The number of selected features (from the best param config when hyperparameters searching). ranking_ : ndarray of shape (n_features,) The feature ranking, such that ``ranking_[i]`` corresponds to the ranking position of the i-th feature (from the best param config when hyperparameters searching). Selected features are assigned rank 1 (2: tentative upper bound, 3: tentative lower bound, 4: rejected). support_ : ndarray of shape (n_features,) The mask of selected features (from the best param config when hyperparameters searching). importance_history_ : ndarray of shape (n_features, n_iters) The importance values for each feature across all iterations. best_params_ : dict Available only when hyperparameters searching. Parameter setting that gave the best results on the eval_set. trials_ : list Available only when hyperparameters searching. A list of dicts. The dicts are all the parameter combinations tried and derived from the param_grid. best_score_ : float Available only when hyperparameters searching. The best score achieved by all the possible combination created. scores_ : list Available only when hyperparameters searching. The scores achived on the eval_set by all the models tried. best_iter_ : int Available only when hyperparameters searching. The boosting iterations achieved by the best parameters combination. iterations_ : list Available only when hyperparameters searching. The boosting iterations of all the models tried. boost_type_ : str The type of the boosting estimator (LGB or XGB). Notes ----- The code for the Boruta algorithm is inspired and improved from: https://github.com/scikit-learn-contrib/boruta_py """ def __init__(self, estimator, *, perc=100, alpha=0.05, max_iter=100, early_stopping_boruta_rounds=None, param_grid=None, greater_is_better=False, importance_type='feature_importances', train_importance=True, n_iter=None, sampling_seed=None, verbose=1, n_jobs=None): self.estimator = estimator self.perc = perc self.alpha = alpha self.max_iter = max_iter self.early_stopping_boruta_rounds = early_stopping_boruta_rounds self.param_grid = param_grid self.greater_is_better = greater_is_better self.importance_type = importance_type self.train_importance = train_importance self.n_iter = n_iter self.sampling_seed = sampling_seed self.verbose = verbose self.n_jobs = n_jobs def _build_model(self, params=None): """Private method to build model.""" estimator = clone(self.estimator) if params is None: model = _Boruta( estimator=estimator, perc=self.perc, alpha=self.alpha, max_iter=self.max_iter, early_stopping_boruta_rounds=self.early_stopping_boruta_rounds, importance_type=self.importance_type, train_importance=self.train_importance, verbose=self.verbose ) else: estimator.set_params(**params) model = _Boruta( estimator=estimator, perc=self.perc, alpha=self.alpha, max_iter=self.max_iter, early_stopping_boruta_rounds=self.early_stopping_boruta_rounds, importance_type=self.importance_type, train_importance=self.train_importance, verbose=self.verbose ) return model class BoostRFE(_BoostSearch, _RFE): """Simultaneous features selection with RFE and hyperparamater searching on a given validation set for LGBModel or XGBModel. Pass a LGBModel or XGBModel to compute features selection with RFE. The gradient boosting instance with the best features is selected. When a eval_set is provided, the best gradient boosting and the best features are obtained evaluating the score with eval_metric. Otherwise, the best combination is obtained looking only at feature importance. If param_grid is a dictionary with parameter boundaries, a hyperparameter tuning is computed simultaneously. The parameter combinations are scored on the provided eval_set. To operate random search pass distributions in the param_grid with rvs method for sampling (such as those from scipy.stats.distributions). To operate bayesian search pass hyperopt distributions. The specification of n_iter or sampling_seed is effective only with random or hyperopt searches. The best parameter combination is the one which obtain the better score (as returned by eval_metric) on the provided eval_set. If all parameters are presented as a list/floats/integers, grid-search is performed. If at least one parameter is given as a distribution (such as those from scipy.stats.distributions), random-search is performed computing sampling with replacement. Bayesian search is effective only when all the parameters to tune are in form of hyperopt distributions. It is highly recommended to use continuous distributions for continuous parameters. Parameters ---------- estimator : object A supervised learning estimator of LGBModel or XGBModel type. step : int or float, default=1 If greater than or equal to 1, then `step` corresponds to the (integer) number of features to remove at each iteration. If within (0.0, 1.0), then `step` corresponds to the percentage (rounded down) of features to remove at each iteration. Note that the last iteration may remove fewer than `step` features in order to reach `min_features_to_select`. min_features_to_select : int, default=None The minimum number of features to be selected. This number of features will always be scored, even if the difference between the original feature count and `min_features_to_select` isn't divisible by `step`. The default value for min_features_to_select is set to 1 when a eval_set is provided, otherwise it always corresponds to n_features // 2. importance_type : str, default='feature_importances' Which importance measure to use. It can be 'feature_importances' (the default feature importance of the gradient boosting estimator) or 'shap_importances'. train_importance : bool, default=True Effective only when importance_type='shap_importances'. Where to compute the shap feature importance: on train (True) or on eval_set (False). param_grid : dict, default=None Dictionary with parameters names (`str`) as keys and distributions or lists of parameters to try. None means no hyperparameters search. greater_is_better : bool, default=False Effective only when hyperparameters searching. Whether the quantity to monitor is a score function, meaning high is good, or a loss function, meaning low is good. n_iter : int, default=None Effective only when hyperparameters searching. Effective only for random or hyperopt serach. Number of parameter settings that are sampled. n_iter trades off runtime vs quality of the solution. sampling_seed : int, default=None Effective only when hyperparameters searching. Effective only for random or hyperopt serach. The seed used to sample from the hyperparameter distributions. n_jobs : int, default=None Effective only when hyperparameters searching without hyperopt. The number of jobs to run in parallel for model fitting. ``None`` means 1 using one processor. ``-1`` means using all processors. verbose : int, default=1 Verbosity mode. <=0 silent all; ==1 print trial logs (when hyperparameters searching); >1 print feature selection logs plus trial logs (when hyperparameters searching). Attributes ---------- estimator_ : estimator The fitted estimator with the select features and the optimal parameter combination (when hyperparameters searching). n_features_ : int The number of selected features (from the best param config when hyperparameters searching). ranking_ : ndarray of shape (n_features,) The feature ranking, such that ``ranking_[i]`` corresponds to the ranking position of the i-th feature (from the best param config when hyperparameters searching). Selected features are assigned rank 1. support_ : ndarray of shape (n_features,) The mask of selected features (from the best param config when hyperparameters searching). score_history_ : list Available only when a eval_set is provided. Scores obtained reducing the features (from the best param config when hyperparameters searching). best_params_ : dict Available only when hyperparameters searching. Parameter setting that gave the best results on the eval_set. trials_ : list Available only when hyperparameters searching. A list of dicts. The dicts are all the parameter combinations tried and derived from the param_grid. best_score_ : float Available only when hyperparameters searching. The best score achieved by all the possible combination created. scores_ : list Available only when hyperparameters searching. The scores achieved on the eval_set by all the models tried. best_iter_ : int Available only when hyperparameters searching. The boosting iterations achieved by the best parameters combination. iterations_ : list Available only when hyperparameters searching. The boosting iterations of all the models tried. boost_type_ : str The type of the boosting estimator (LGB or XGB). """ def __init__(self, estimator, *, min_features_to_select=None, step=1, param_grid=None, greater_is_better=False, importance_type='feature_importances', train_importance=True, n_iter=None, sampling_seed=None, verbose=1, n_jobs=None): self.estimator = estimator self.min_features_to_select = min_features_to_select self.step = step self.param_grid = param_grid self.greater_is_better = greater_is_better self.importance_type = importance_type self.train_importance = train_importance self.n_iter = n_iter self.sampling_seed = sampling_seed self.verbose = verbose self.n_jobs = n_jobs def _build_model(self, params=None): """Private method to build model.""" estimator = clone(self.estimator) if params is None: model = _RFE( estimator=estimator, min_features_to_select=self.min_features_to_select, step=self.step, greater_is_better=self.greater_is_better, importance_type=self.importance_type, train_importance=self.train_importance, verbose=self.verbose ) else: estimator.set_params(**params) model = _RFE( estimator=estimator, min_features_to_select=self.min_features_to_select, step=self.step, greater_is_better=self.greater_is_better, importance_type=self.importance_type, train_importance=self.train_importance, verbose=self.verbose ) return model class BoostRFA(_BoostSearch, _RFA): """Simultaneous features selection with RFA and hyperparamater searching on a given validation set for LGBModel or XGBModel. Pass a LGBModel or XGBModel to compute features selection with RFA. The gradient boosting instance with the best features is selected. When a eval_set is provided, the best gradient boosting and the best features are obtained evaluating the score with eval_metric. Otherwise, the best combination is obtained looking only at feature importance. If param_grid is a dictionary with parameter boundaries, a hyperparameter tuning is computed simultaneously. The parameter combinations are scored on the provided eval_set. To operate random search pass distributions in the param_grid with rvs method for sampling (such as those from scipy.stats.distributions). To operate bayesian search pass hyperopt distributions. The specification of n_iter or sampling_seed is effective only with random or hyperopt searches. The best parameter combination is the one which obtain the better score (as returned by eval_metric) on the provided eval_set. If all parameters are presented as a list/floats/integers, grid-search is performed. If at least one parameter is given as a distribution (such as those from scipy.stats.distributions), random-search is performed computing sampling with replacement. Bayesian search is effective only when all the parameters to tune are in form of hyperopt distributions. It is highly recommended to use continuous distributions for continuous parameters. Parameters ---------- estimator : object A supervised learning estimator of LGBModel or XGBModel type. step : int or float, default=1 If greater than or equal to 1, then `step` corresponds to the (integer) number of features to remove at each iteration. If within (0.0, 1.0), then `step` corresponds to the percentage (rounded down) of features to remove at each iteration. Note that the last iteration may remove fewer than `step` features in order to reach `min_features_to_select`. min_features_to_select : int, default=None The minimum number of features to be selected. This number of features will always be scored, even if the difference between the original feature count and `min_features_to_select` isn't divisible by `step`. The default value for min_features_to_select is set to 1 when a eval_set is provided, otherwise it always corresponds to n_features // 2. importance_type : str, default='feature_importances' Which importance measure to use. It can be 'feature_importances' (the default feature importance of the gradient boosting estimator) or 'shap_importances'. train_importance : bool, default=True Effective only when importance_type='shap_importances'. Where to compute the shap feature importance: on train (True) or on eval_set (False). param_grid : dict, default=None Dictionary with parameters names (`str`) as keys and distributions or lists of parameters to try. None means no hyperparameters search. greater_is_better : bool, default=False Effective only when hyperparameters searching. Whether the quantity to monitor is a score function, meaning high is good, or a loss function, meaning low is good. n_iter : int, default=None Effective only when hyperparameters searching. Effective only for random or hyperopt serach. Number of parameter settings that are sampled. n_iter trades off runtime vs quality of the solution. sampling_seed : int, default=None Effective only when hyperparameters searching. Effective only for random or hyperopt serach. The seed used to sample from the hyperparameter distributions. n_jobs : int, default=None Effective only when hyperparameters searching without hyperopt. The number of jobs to run in parallel for model fitting. ``None`` means 1 using one processor. ``-1`` means using all processors. verbose : int, default=1 Verbosity mode. <=0 silent all; ==1 print trial logs (when hyperparameters searching); >1 print feature selection logs plus trial logs (when hyperparameters searching). Attributes ---------- estimator_ : estimator The fitted estimator with the select features and the optimal parameter combination (when hyperparameters searching). n_features_ : int The number of selected features (from the best param config when hyperparameters searching). ranking_ : ndarray of shape (n_features,) The feature ranking, such that ``ranking_[i]`` corresponds to the ranking position of the i-th feature (from the best param config when hyperparameters searching). Selected features are assigned rank 1. support_ : ndarray of shape (n_features,) The mask of selected features (from the best param config when hyperparameters searching). score_history_ : list Available only when a eval_set is provided. Scores obtained reducing the features (from the best param config when hyperparameters searching). best_params_ : dict Available only when hyperparameters searching. Parameter setting that gave the best results on the eval_set. trials_ : list Available only when hyperparameters searching. A list of dicts. The dicts are all the parameter combinations tried and derived from the param_grid. best_score_ : float Available only when hyperparameters searching. The best score achieved by all the possible combination created. scores_ : list Available only when hyperparameters searching. The scores achieved on the eval_set by all the models tried. best_iter_ : int Available only when hyperparameters searching. The boosting iterations achieved by the best parameters combination. iterations_ : list Available only when hyperparameters searching. The boosting iterations of all the models tried. boost_type_ : str The type of the boosting estimator (LGB or XGB). Notes ----- The code for the RFA algorithm is inspired and improved from: https://github.com/heberleh/recursive-feature-addition """ def __init__(self, estimator, *, min_features_to_select=None, step=1, param_grid=None, greater_is_better=False, importance_type='feature_importances', train_importance=True, n_iter=None, sampling_seed=None, verbose=1, n_jobs=None): self.estimator = estimator self.min_features_to_select = min_features_to_select self.step = step self.param_grid = param_grid self.greater_is_better = greater_is_better self.importance_type = importance_type self.train_importance = train_importance self.n_iter = n_iter self.sampling_seed = sampling_seed self.verbose = verbose self.n_jobs = n_jobs def _build_model(self, params=None): """Private method to build model.""" estimator = clone(self.estimator) if params is None: model = _RFA( estimator=estimator, min_features_to_select=self.min_features_to_select, step=self.step, greater_is_better=self.greater_is_better, importance_type=self.importance_type, train_importance=self.train_importance, verbose=self.verbose ) else: estimator.set_params(**params) model = _RFA( estimator=estimator, min_features_to_select=self.min_features_to_select, step=self.step, greater_is_better=self.greater_is_better, importance_type=self.importance_type, train_importance=self.train_importance, verbose=self.verbose ) return model ================================================ FILE: shaphypetune/utils.py ================================================ import random import numpy as np from itertools import product from shap import TreeExplainer def _check_boosting(model): """Check if the estimator is a LGBModel or XGBModel. Returns ------- Model type in string format. """ estimator_type = str(type(model)).lower() boost_type = ('LGB' if 'lightgbm' in estimator_type else '') + \ ('XGB' if 'xgboost' in estimator_type else '') if len(boost_type) != 3: raise ValueError("Pass a LGBModel or XGBModel.") return boost_type def _shap_importances(model, X): """Extract feature importances from fitted boosting models using TreeExplainer from shap. Returns ------- array of feature importances. """ explainer = TreeExplainer( model, feature_perturbation="tree_path_dependent") coefs = explainer.shap_values(X) if isinstance(coefs, list): coefs = list(map(lambda x: np.abs(x).mean(0), coefs)) coefs = np.sum(coefs, axis=0) else: coefs = np.abs(coefs).mean(0) return coefs def _feature_importances(model): """Extract feature importances from fitted boosting models. Returns ------- array of feature importances. """ if hasattr(model, 'coef_'): ## booster='gblinear' (xgb) coefs = np.square(model.coef_).sum(axis=0) else: coefs = model.feature_importances_ return coefs def _get_categorical_support(n_features, fit_params): """Obtain boolean mask for categorical features""" cat_support = np.zeros(n_features, dtype=bool) cat_ids = [] msg = "When manually setting categarical features, " \ "pass a 1D array-like of categorical columns indices " \ "(specified as integers)." if 'categorical_feature' in fit_params: # LGB cat_ids = fit_params['categorical_feature'] if len(np.shape(cat_ids)) != 1: raise ValueError(msg) if not all([isinstance(c, int) for c in cat_ids]): raise ValueError(msg) cat_support[cat_ids] = True return cat_support def _set_categorical_indexes(support, cat_support, _fit_params, duplicate=False): """Map categorical features in each data repartition""" if cat_support.any(): n_features = support.sum() support_id = np.zeros_like(support, dtype='int32') support_id[support] = np.arange(n_features, dtype='int32') cat_feat = support_id[np.where(support & cat_support)[0]] # empty if support and cat_support are not alligned if duplicate: # is Boruta cat_feat = cat_feat.tolist() + (n_features + cat_feat).tolist() else: cat_feat = cat_feat.tolist() _fit_params['categorical_feature'] = cat_feat return _fit_params def _check_param(values): """Check the parameter boundaries passed in dict values. Returns ------- list of checked parameters. """ if isinstance(values, (list, tuple, np.ndarray)): return list(set(values)) elif 'scipy' in str(type(values)).lower(): return values elif 'hyperopt' in str(type(values)).lower(): return values else: return [values] class ParameterSampler(object): """Generator on parameters sampled from given distributions. If all parameters are presented as a list, sampling without replacement is performed. If at least one parameter is given as a scipy distribution, sampling with replacement is used. If all parameters are given as hyperopt distributions Tree of Parzen Estimators searching from hyperopt is computed. It is highly recommended to use continuous distributions for continuous parameters. Parameters ---------- param_distributions : dict Dictionary with parameters names (`str`) as keys and distributions or lists of parameters to try. Distributions must provide a ``rvs`` method for random sampling (such as those from scipy.stats.distributions) or be hyperopt distributions for bayesian searching. If a list is given, it is sampled uniformly. n_iter : integer, default=None Number of parameter configurations that are produced. random_state : int, default=None Pass an int for reproducible output across multiple function calls. Returns ------- param_combi : list of dicts or dict of hyperopt distributions Parameter combinations. searching_type : str The searching algorithm used. """ def __init__(self, param_distributions, n_iter=None, random_state=None): self.n_iter = n_iter self.random_state = random_state self.param_distributions = param_distributions def sample(self): """Generator parameter combinations from given distributions.""" param_distributions = self.param_distributions.copy() is_grid = all(isinstance(p, list) for p in param_distributions.values()) is_random = all(isinstance(p, list) or 'scipy' in str(type(p)).lower() for p in param_distributions.values()) is_hyperopt = all('hyperopt' in str(type(p)).lower() or (len(p) < 2 if isinstance(p, list) else False) for p in param_distributions.values()) if is_grid: param_combi = list(product(*param_distributions.values())) param_combi = [ dict(zip(param_distributions.keys(), combi)) for combi in param_combi ] return param_combi, 'grid' elif is_random: if self.n_iter is None: raise ValueError( "n_iter must be an integer >0 when scipy parameter " "distributions are provided. Get None." ) seed = (random.randint(1, 100) if self.random_state is None else self.random_state + 1) random.seed(seed) param_combi = [] k = self.n_iter for i in range(self.n_iter): dist = param_distributions.copy() combi = [] for j, v in enumerate(dist.values()): if 'scipy' in str(type(v)).lower(): combi.append(v.rvs(random_state=seed * (k + j))) else: combi.append(v[random.randint(0, len(v) - 1)]) k += i + j param_combi.append( dict(zip(param_distributions.keys(), combi)) ) np.random.mtrand._rand return param_combi, 'random' elif is_hyperopt: if self.n_iter is None: raise ValueError( "n_iter must be an integer >0 when hyperopt " "search spaces are provided. Get None." ) param_distributions = { k: p[0] if isinstance(p, list) else p for k, p in param_distributions.items() } return param_distributions, 'hyperopt' else: raise ValueError( "Parameters not recognized. " "Pass lists, scipy distributions (also in conjunction " "with lists), or hyperopt search spaces." )