Showing preview only (4,036K chars total). Download the full file or copy to clipboard to get everything.
Repository: EBjerrum/scikit-mol
Branch: main
Commit: 119cb8c00bb4
Files: 128
Total size: 3.7 MB
Directory structure:
gitextract_o126_axo/
├── .github/
│ ├── CODEOWNERS
│ └── workflows/
│ ├── code_quality.yaml
│ ├── publish.yaml
│ ├── pytest.yaml
│ └── welcome.yaml
├── .gitignore
├── .pre-commit-config.yaml
├── .readthedocs.yaml
├── .vscode/
│ ├── extensions.json
│ └── settings.json
├── CITATION.bib
├── CONTRIBUTING.md
├── LICENSE
├── MANIFEST.in
├── Makefile
├── README.md
├── docs/
│ ├── api/
│ │ ├── fingerprints.base.md
│ │ ├── scikit_mol.applicability.md
│ │ ├── scikit_mol.conversions.md
│ │ ├── scikit_mol.core.md
│ │ ├── scikit_mol.descriptors.md
│ │ ├── scikit_mol.fingerprints.md
│ │ ├── scikit_mol.parallel.md
│ │ ├── scikit_mol.plotting.md
│ │ ├── scikit_mol.safeinference.md
│ │ └── scikit_mol.standardizer.md
│ ├── assets/
│ │ ├── css/
│ │ │ └── tweak-width.css
│ │ └── js/
│ │ └── readthedocs.js
│ ├── contributing.md
│ ├── index.md
│ ├── notebooks/
│ │ ├── 01_basic_usage.ipynb
│ │ ├── 02_descriptor_transformer.ipynb
│ │ ├── 03_example_pipeline.ipynb
│ │ ├── 04_standardizer.ipynb
│ │ ├── 05_smiles_sanitization.ipynb
│ │ ├── 06_hyperparameter_tuning.ipynb
│ │ ├── 07_parallel_transforms.ipynb
│ │ ├── 08_external_library_skopt.ipynb
│ │ ├── 09_Combinatorial_Method_Usage_with_FingerPrint_Transformers.ipynb
│ │ ├── 10_pipeline_pandas_output.ipynb
│ │ ├── 11_safe_inference.ipynb
│ │ ├── 12_custom_fingerprint_transformer.ipynb
│ │ ├── 13_applicability_domain.ipynb
│ │ ├── README.md
│ │ ├── pair_notebook.sh
│ │ ├── run_notebooks.sh
│ │ ├── scripts/
│ │ │ ├── 01_basic_usage.py
│ │ │ ├── 02_descriptor_transformer.py
│ │ │ ├── 03_example_pipeline.py
│ │ │ ├── 04_standardizer.py
│ │ │ ├── 05_smiles_sanitization.py
│ │ │ ├── 06_hyperparameter_tuning.py
│ │ │ ├── 07_parallel_transforms.py
│ │ │ ├── 08_external_library_skopt.py
│ │ │ ├── 09_Combinatorial_Method_Usage_with_FingerPrint_Transformers.py
│ │ │ ├── 10_pipeline_pandas_output.py
│ │ │ ├── 11_safe_inference.py
│ │ │ ├── 12_custom_fingerprint_transformer.py
│ │ │ └── 13_applicability_domain.py
│ │ └── sync_notebooks.sh
│ └── overrides/
│ └── main.html
├── mkdocs.yml
├── pyproject.toml
├── resources/
│ └── logo/
│ ├── ScikitMol_Logo.ai
│ └── ScikitMol_Logo_Hybrid.ai
├── ruff.toml
├── scikit_mol/
│ ├── __init__.py
│ ├── _constants.py
│ ├── applicability/
│ │ ├── LICENSE.MIT
│ │ ├── README.md
│ │ ├── __init__.py
│ │ ├── base.py
│ │ ├── bounding_box.py
│ │ ├── convex_hull.py
│ │ ├── hotelling.py
│ │ ├── isolation_forest.py
│ │ ├── kernel_density.py
│ │ ├── knn.py
│ │ ├── leverage.py
│ │ ├── local_outlier.py
│ │ ├── mahalanobis.py
│ │ ├── standardization.py
│ │ └── topkat.py
│ ├── conversions.py
│ ├── core.py
│ ├── descriptors.py
│ ├── fingerprints/
│ │ ├── __init__.py
│ │ ├── atompair.py
│ │ ├── avalon.py
│ │ ├── baseclasses.py
│ │ ├── maccs.py
│ │ ├── minhash.py
│ │ ├── morgan.py
│ │ ├── rdkitfp.py
│ │ └── topologicaltorsion.py
│ ├── parallel.py
│ ├── plotting.py
│ ├── safeinference.py
│ ├── standardizer.py
│ └── utilities.py
├── setup.cfg
├── tests/
│ ├── __init__.py
│ ├── applicability/
│ │ ├── __init__.py
│ │ ├── conftest.py
│ │ ├── test_base.py
│ │ ├── test_bounding_box.py
│ │ ├── test_convex_hull.py
│ │ ├── test_hotelling.py
│ │ ├── test_isolation_forest.py
│ │ ├── test_kernel_density.py
│ │ ├── test_knn.py
│ │ ├── test_leverage.py
│ │ ├── test_local_outlier.py
│ │ ├── test_mahalanobis.py
│ │ ├── test_standardization.py
│ │ └── test_topkat.py
│ ├── conftest.py
│ ├── fixtures.py
│ ├── test_desctransformer.py
│ ├── test_fptransformers.py
│ ├── test_fptransformersgenerator.py
│ ├── test_parameter_types.py
│ ├── test_safeinferencemode.py
│ ├── test_sanitizer.py
│ ├── test_scikit_mol.py
│ ├── test_smilestomol.py
│ └── test_transformers.py
└── uv.toml
================================================
FILE CONTENTS
================================================
================================================
FILE: .github/CODEOWNERS
================================================
* @EBjerrum
scikit_mol/parallel.py @asiomchen
scikit_mol/plotting.py @asiomchen
================================================
FILE: .github/workflows/code_quality.yaml
================================================
name: Code Quality Checks
on: [ push, pull_request ]
jobs:
ruff-checks:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Check code formatting
uses: astral-sh/ruff-action@v3
with:
version: 0.8.6
args: "format --check"
src: "./scikit_mol"
- name: Check code style
uses: astral-sh/ruff-action@v3
with:
version: 0.8.6
args: "check"
src: "./scikit_mol"
================================================
FILE: .github/workflows/publish.yaml
================================================
name: Publish Python🐍 distribution📦 with uv🌈
# after releasing a new version, build the distribution and uploads signed artifacts to GitHub Release
on:
workflow_call:
inputs:
python-version:
type: string
description: "Python version to use"
required: true
default: "3.9"
is-draft:
type: boolean
description: "Is this a draft release?"
required: false
default: false
dist-artifact-name:
type: string
description: "Name of the created distribution artifact"
required: false
default: "python-package-distributions"
jobs:
build-and-publish:
name: Build distribution📦
runs-on: ubuntu-latest
permissions:
contents: write #
id-token: write # IMPORTANT: mandatory for sigstore
steps:
- uses: actions/checkout@v4
with:
persist-credentials: false
- name: Set up Python🐍
uses: astral-sh/setup-uv@v5
with:
python-version: ${{ inputs.python-version }}
- name: Build a binary wheel and a source tarball
run: uv build
- name: Store the distribution packages
uses: actions/upload-artifact@v4
with:
name: ${{ inputs.dist-artifact-name }}
path: dist/
github-release:
name: >-
Sign the distribution📦 with Sigstore
and upload them to GitHub Release
needs:
- build-and-publish
runs-on: ubuntu-latest
permissions:
contents: write # IMPORTANT: mandatory for making GitHub Releases
id-token: write # IMPORTANT: mandatory for sigstore
steps:
- name: Download all the dists
uses: actions/download-artifact@v4
with:
name: ${{ inputs.dist-artifact-name }}
path: dist/
- name: Sign the dists with Sigstore🔏
uses: sigstore/gh-action-sigstore-python@v3.0.0
with:
inputs: >-
./dist/*.tar.gz
./dist/*.whl
- name: Create GitHub Release
env:
GITHUB_TOKEN: ${{ github.token }}
run: >-
gh release create
"$GITHUB_REF_NAME"
--repo "$GITHUB_REPOSITORY"
--generate-notes ${{ inputs.is-draft && '--draft' || '' }}
- name: Upload artifact signatures to GitHub Release
env:
GITHUB_TOKEN: ${{ github.token }}
run: >-
gh release upload
"$GITHUB_REF_NAME" dist/**
--repo "$GITHUB_REPOSITORY"
================================================
FILE: .github/workflows/pytest.yaml
================================================
name: scikit_mol ci
on:
push:
branches: [main]
tags: ['v*']
pull_request:
branches: [main]
# cancel previously running tests if new commits are made
# https://docs.github.com/en/actions/examples/using-concurrency-expressions-and-a-test-matrix
concurrency:
group: actions-id-${{ github.workflow }}-${{ github.event.pull_request.number || github.ref }}
cancel-in-progress: true
jobs:
# run pytests for scikit_mol
tests:
name: pytest ${{ matrix.os }}::py${{ matrix.python-version }}
runs-on: ${{ matrix.os }}
strategy:
max-parallel: 6
fail-fast: false
matrix:
os: [ubuntu-latest, macos-latest, windows-latest]
python-version: ["3.10"]
include:
# test python version compatibility on linux only
- os: ubuntu-latest
python-version: 3.13
- os: ubuntu-latest
python-version: 3.12
- os: ubuntu-latest
python-version: 3.11
- os: ubuntu-latest
python-version: 3.10
- os: ubuntu-latest
python-version: 3.9
steps:
- name: Checkout scikit_mol
uses: actions/checkout@v4
- name: Install uv and set the python version
uses: astral-sh/setup-uv@v5
with:
python-version: ${{ matrix.python-version }}
- name: Install scikit_mol
run: uv sync --dev
- name: Cache tests/data
uses: actions/cache@v4
with:
path: tests/data
key: ${{ runner.os }}-${{ hashFiles('tests/conftest.py') }}
- name: Run Tests
run: uv run pytest --cov=./scikit_mol .
build-and-create-signed-release:
name: Build distribution📦 & create Github Release
if: github.event_name == 'push' && startsWith(github.ref, 'refs/tags/v')
needs: tests
uses: ./.github/workflows/publish.yaml
with:
python-version: "3.9"
is-draft: true
publish:
name: Publish to PyPI
needs: build-and-create-signed-release
runs-on: ubuntu-latest
# will be enabled in the future
# environment:
# name: pypi
# url: https://pypi.org/p/scikit-mol
permissions:
id-token: write # IMPORTANT: mandatory for trusted publishing
steps:
- name: Download all the dists
uses: actions/download-artifact@v4
with:
name: python-package-distributions
path: dist/
- name: Publish to PyPI
uses: pypa/gh-action-pypi-publish@release/v1
================================================
FILE: .github/workflows/welcome.yaml
================================================
name: Welcome WorkFlow
on:
issues:
types: [opened]
pull_request_target:
types: [opened]
jobs:
build:
name: 👋 Welcome
permissions: write-all
runs-on: ubuntu-latest
steps:
- uses: actions/first-interaction@v1.3.0
with:
repo-token: ${{ secrets.GITHUB_TOKEN }}
issue-message: "🎉 Welcome to scikit-mol! 🧪✨ Thank you for opening your first issue! 🚀 Your feedback helps improve the project and makes a difference. 💡 If you have any questions or need guidance, don't hesitate to ask. We're here to help! 🤝"
pr-message: "🎉 Welcome to scikit-mol! 🧪✨ Thank you for submitting your first pull request! 🔧 Your effort and contributions mean a lot to us. 🙌 We'll review it as soon as possible. 🚀"
================================================
FILE: .gitignore
================================================
# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
*$py.class
# C extensions
*.so
# Distribution / packaging
.Python
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
wheels/
pip-wheel-metadata/
share/python-wheels/
*.egg-info/
.installed.cfg
*.egg
MANIFEST
# PyInstaller
# Usually these files are written by a python script from a template
# before PyInstaller builds the exe, so as to inject date/other infos into it.
*.manifest
*.spec
# Installer logs
pip-log.txt
pip-delete-this-directory.txt
# Unit test / coverage reports
htmlcov/
.tox/
.nox/
.coverage
.coverage.*
.cache
nosetests.xml
coverage.xml
*.cover
*.py,cover
.hypothesis/
.pytest_cache/
# Translations
*.mo
*.pot
# Django stuff:
*.log
local_settings.py
db.sqlite3
db.sqlite3-journal
# Flask stuff:
instance/
.webassets-cache
# Scrapy stuff:
.scrapy
# Sphinx documentation
docs/_build/
# PyBuilder
target/
# Jupyter Notebook
.ipynb_checkpoints
# IPython
profile_default/
ipython_config.py
# pyenv
.python-version
# pipenv
# According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
# However, in case of collaboration, if having platform-specific dependencies or dependencies
# having no cross-platform support, pipenv may install dependencies that don't work, or not
# install all needed dependencies.
#Pipfile.lock
# PEP 582; used by e.g. github.com/David-OConnor/pyflow
__pypackages__/
# Celery stuff
celerybeat-schedule
celerybeat.pid
# SageMath parsed files
*.sage.py
# Environments
.env
.venv
env/
venv/
ENV/
env.bak/
venv.bak/
# Spyder project settings
.spyderproject
.spyproject
# Rope project settings
.ropeproject
# mkdocs documentation
/site
# mypy
.mypy_cache/
.dmypy.json
dmypy.json
# Pyre type checker
.pyre/
.scikit-mol.code-workspace.swp
scikit-mol.code-workspace
# test data
tests/data/
# setuptools_scm version
scikit_mol/_version.py
notebooks/SLC6A4_active_excape_export.csv
sandbox/
# PyCharm settings
.idea
================================================
FILE: .pre-commit-config.yaml
================================================
repos:
- repo: https://github.com/pre-commit/pre-commit-hooks
rev: v4.5.0
hooks:
- id: requirements-txt-fixer
- id: mixed-line-ending
- id: check-yaml
- id: check-json
- id: pretty-format-json
args: ['--autofix']
exclude: .ipynb
- id: check-added-large-files
- id: check-merge-conflict
- repo: https://github.com/astral-sh/ruff-pre-commit
# Ruff version.
rev: v0.8.6
hooks:
# Run the linter.
- id: ruff
args: [ --fix ]
types_or: [ python, pyi ]
# Run the formatter.
- id: ruff-format
================================================
FILE: .readthedocs.yaml
================================================
# Read the Docs configuration file for MkDocs projects
# See https://docs.readthedocs.io/en/stable/config-file/v2.html for details
# Required
version: 2
# Set the version of Python and other tools you might need
build:
os: ubuntu-22.04
tools:
python: "3.12"
jobs:
# Use uv to speed up the build
# https://docs.readthedocs.io/en/stable/build-customization.html#install-dependencies-with-uv
pre_create_environment:
- asdf plugin add uv
- asdf install uv 0.5.24
- asdf global uv 0.5.24
create_environment:
- uv venv
install:
- uv sync --group docs
build:
html:
- uv run mkdocs build -d $READTHEDOCS_OUTPUT/html
mkdocs:
configuration: mkdocs.yml
================================================
FILE: .vscode/extensions.json
================================================
{
"recommendations": [
"njpwerner.autodocstring"
]
}
================================================
FILE: .vscode/settings.json
================================================
{
"autoDocstring.docstringFormat": "numpy"
}
================================================
FILE: CITATION.bib
================================================
@article{bjerrum_scikit-mol_2023,
title = {Scikit-{Mol} brings cheminformatics to {Scikit}-{Learn}},
author = {Bjerrum, Esben Jannik and Bachorz, Rafał Adam and Bitton, Adrien and Choung, Oh-hyeon and Chen, Ya and Esposito, Carmen and Ha, Son Viet and Poehlmann, Andreas},
year = {2023},
month = dec,
journal = {ChemRxiv},
url = {https://chemrxiv.org/engage/chemrxiv/article-details/60ef0fc58825826143a82cc0},
doi = {10.26434/chemrxiv-2023-fzqwd},
abstract = {Scikit-Mol is a open-source toolkit that aims to bridge the gap between two well-established toolkits, RDKit and Scikit-Learn, in order to provide a simple interface for building cheminformatics models. By leveraging the strengths of both RDKit and Scikit-Learn, Scikit-Mol provides a powerful platform for creating predictive modeling in drug discovery and materials design. Unlike other toolkits that often integrate both chemistry and machine learning, Scikit-Mol rather aims to be a simple bridge between the two, reducing the maintenance effort required to keep up with changes and new features in e.g. Scikit-Learn. A simple example of Scikit-Mol's functionality is provided, demonstrating its compatibility with Scikit-Learn pipelines. Overall, Scikit-Mol provides a useful and flexible package for building self-contained and self-documented cheminformatics models with minimal maintenance required.},
language = {en},
urldate = {2023-12-06},
keywords = {Cheminformatics, Descriptors, Fingerprints, Machine Learning, RDKit, Scikit-Learn},
note = {preprint}
}
================================================
FILE: CONTRIBUTING.md
================================================
# Contribution
For up-to-date information, see
[docs/contribution.md](docs/contributing.md)
or
[https://scikit-mol.readthedocs.io/en/latest/contributing/](https://scikit-mol.readthedocs.io/en/latest/contributing/)
================================================
FILE: LICENSE
================================================
GNU LESSER GENERAL PUBLIC LICENSE
Version 3, 29 June 2007
Copyright (C) 2007 Free Software Foundation, Inc. <http://fsf.org/>
Everyone is permitted to copy and distribute verbatim copies
of this license document, but changing it is not allowed.
This version of the GNU Lesser General Public License incorporates
the terms and conditions of version 3 of the GNU General Public
License, supplemented by the additional permissions listed below.
0. Additional Definitions.
As used herein, "this License" refers to version 3 of the GNU Lesser
General Public License, and the "GNU GPL" refers to version 3 of the GNU
General Public License.
"The Library" refers to a covered work governed by this License,
other than an Application or a Combined Work as defined below.
An "Application" is any work that makes use of an interface provided
by the Library, but which is not otherwise based on the Library.
Defining a subclass of a class defined by the Library is deemed a mode
of using an interface provided by the Library.
A "Combined Work" is a work produced by combining or linking an
Application with the Library. The particular version of the Library
with which the Combined Work was made is also called the "Linked
Version".
The "Minimal Corresponding Source" for a Combined Work means the
Corresponding Source for the Combined Work, excluding any source code
for portions of the Combined Work that, considered in isolation, are
based on the Application, and not on the Linked Version.
The "Corresponding Application Code" for a Combined Work means the
object code and/or source code for the Application, including any data
and utility programs needed for reproducing the Combined Work from the
Application, but excluding the System Libraries of the Combined Work.
1. Exception to Section 3 of the GNU GPL.
You may convey a covered work under sections 3 and 4 of this License
without being bound by section 3 of the GNU GPL.
2. Conveying Modified Versions.
If you modify a copy of the Library, and, in your modifications, a
facility refers to a function or data to be supplied by an Application
that uses the facility (other than as an argument passed when the
facility is invoked), then you may convey a copy of the modified
version:
a) under this License, provided that you make a good faith effort to
ensure that, in the event an Application does not supply the
function or data, the facility still operates, and performs
whatever part of its purpose remains meaningful, or
b) under the GNU GPL, with none of the additional permissions of
this License applicable to that copy.
3. Object Code Incorporating Material from Library Header Files.
The object code form of an Application may incorporate material from
a header file that is part of the Library. You may convey such object
code under terms of your choice, provided that, if the incorporated
material is not limited to numerical parameters, data structure
layouts and accessors, or small macros, inline functions and templates
(ten or fewer lines in length), you do both of the following:
a) Give prominent notice with each copy of the object code that the
Library is used in it and that the Library and its use are
covered by this License.
b) Accompany the object code with a copy of the GNU GPL and this license
document.
4. Combined Works.
You may convey a Combined Work under terms of your choice that,
taken together, effectively do not restrict modification of the
portions of the Library contained in the Combined Work and reverse
engineering for debugging such modifications, if you also do each of
the following:
a) Give prominent notice with each copy of the Combined Work that
the Library is used in it and that the Library and its use are
covered by this License.
b) Accompany the Combined Work with a copy of the GNU GPL and this license
document.
c) For a Combined Work that displays copyright notices during
execution, include the copyright notice for the Library among
these notices, as well as a reference directing the user to the
copies of the GNU GPL and this license document.
d) Do one of the following:
0) Convey the Minimal Corresponding Source under the terms of this
License, and the Corresponding Application Code in a form
suitable for, and under terms that permit, the user to
recombine or relink the Application with a modified version of
the Linked Version to produce a modified Combined Work, in the
manner specified by section 6 of the GNU GPL for conveying
Corresponding Source.
1) Use a suitable shared library mechanism for linking with the
Library. A suitable mechanism is one that (a) uses at run time
a copy of the Library already present on the user's computer
system, and (b) will operate properly with a modified version
of the Library that is interface-compatible with the Linked
Version.
e) Provide Installation Information, but only if you would otherwise
be required to provide such information under section 6 of the
GNU GPL, and only to the extent that such information is
necessary to install and execute a modified version of the
Combined Work produced by recombining or relinking the
Application with a modified version of the Linked Version. (If
you use option 4d0, the Installation Information must accompany
the Minimal Corresponding Source and Corresponding Application
Code. If you use option 4d1, you must provide the Installation
Information in the manner specified by section 6 of the GNU GPL
for conveying Corresponding Source.)
5. Combined Libraries.
You may place library facilities that are a work based on the
Library side by side in a single library together with other library
facilities that are not Applications and are not covered by this
License, and convey such a combined library under terms of your
choice, if you do both of the following:
a) Accompany the combined library with a copy of the same work based
on the Library, uncombined with any other library facilities,
conveyed under the terms of this License.
b) Give prominent notice with the combined library that part of it
is a work based on the Library, and explaining where to find the
accompanying uncombined form of the same work.
6. Revised Versions of the GNU Lesser General Public License.
The Free Software Foundation may publish revised and/or new versions
of the GNU Lesser General Public License from time to time. Such new
versions will be similar in spirit to the present version, but may
differ in detail to address new problems or concerns.
Each version is given a distinguishing version number. If the
Library as you received it specifies that a certain numbered version
of the GNU Lesser General Public License "or any later version"
applies to it, you have the option of following the terms and
conditions either of that published version or of any later version
published by the Free Software Foundation. If the Library as you
received it does not specify a version number of the GNU Lesser
General Public License, you may choose any version of the GNU Lesser
General Public License ever published by the Free Software Foundation.
If the Library as you received it specifies that a proxy can decide
whether future versions of the GNU Lesser General Public License shall
apply, that proxy's public statement of acceptance of any version is
permanent authorization for you to choose that version for the
Library.
================================================
FILE: MANIFEST.in
================================================
prune .github
exclude .git*
================================================
FILE: Makefile
================================================
sync-notebooks:
uv run jupytext --set-formats docs//notebooks//ipynb,docs//notebooks//scripts//py:percent --sync docs/notebooks/*.ipynb
uv run ruff format "docs/notebooks/"
run-notebooks:
uv run jupytext --execute docs/notebooks/*ipynb
================================================
FILE: README.md
================================================
# scikit-mol

[]()
[](https://pypi.org/project/scikit-mol/)
[](https://anaconda.org/conda-forge/scikit-mol)
[](#)
[](https://www.rdkit.org/)
[](https://github.com/astral-sh/ruff)
## Scikit-Learn classes for molecular vectorization using RDKit
The intended usage is to be able to add molecular vectorization directly into scikit-learn pipelines, so that the final model directly predict on RDKit molecules or SMILES strings
As example with the needed scikit-learn and -mol imports and RDKit mol objects in the mol_list_train and \_test lists:
pipe = Pipeline([('mol_transformer', MorganFingerprintTransformer()), ('Regressor', Ridge())])
pipe.fit(mol_list_train, y_train)
pipe.score(mol_list_test, y_test)
pipe.predict([Chem.MolFromSmiles('c1ccccc1C(=O)C')])
>>> array([4.93858815])
The scikit-learn compatibility should also make it easier to include the fingerprinting step in hyperparameter tuning with scikit-learns utilities
The first draft for the project was created at the [RDKIT UGM 2022 hackathon](https://github.com/rdkit/UGM_2022) 2022-October-14
## Installation
Users can install latest tagged release from pip
```sh
pip install scikit-mol
```
or from conda-forge
```sh
conda install -c conda-forge scikit-mol
```
The conda forge package should get updated shortly after a new tagged release on pypi.
Bleeding edge
```sh
pip install git+https://github.com/EBjerrum/scikit-mol.git
```
## Documentation
Example notebooks and API documentation are now hosted on [https://scikit-mol.readthedocs.io](https://scikit-mol.readthedocs.io/en/latest/)
- [Basic Usage and fingerprint transformers](https://scikit-mol.readthedocs.io/en/latest/notebooks/01_basic_usage/)
- [Descriptor transformer](https://scikit-mol.readthedocs.io/en/latest/notebooks/02_descriptor_transformer/)
- [Pipelining with Scikit-Learn classes](https://scikit-mol.readthedocs.io/en/latest/notebooks/03_example_pipeline/)
- [Molecular standardization](https://scikit-mol.readthedocs.io/en/latest/notebooks/04_standardizer/)
- [Sanitizing SMILES input](https://scikit-mol.readthedocs.io/en/latest/notebooks/05_smiles_sanitization/)
- [Integrated hyperparameter tuning of Scikit-Learn estimator and Scikit-Mol transformer](https://scikit-mol.readthedocs.io/en/latest/notebooks/06_hyperparameter_tuning/)
- [Using parallel execution to speed up descriptor and fingerprint calculations](https://scikit-mol.readthedocs.io/en/latest/notebooks/07_parallel_transforms/)
- [Using skopt for hyperparameter tuning](https://scikit-mol.readthedocs.io/en/latest/notebooks/08_external_library_skopt/)
- [Testing different fingerprints as part of the hyperparameter optimization](https://scikit-mol.readthedocs.io/en/latest/notebooks/09_Combinatorial_Method_Usage_with_FingerPrint_Transformers/)
- [Using pandas output for easy feature importance analysis and combine pre-existing values with new computations](https://scikit-mol.readthedocs.io/en/latest/notebooks/10_pipeline_pandas_output/)
- [Working with pipelines and estimators in safe inference mode for handling prediction on batches with invalid smiles or molecules](https://scikit-mol.readthedocs.io/en/latest/notebooks/11_safe_inference/)
- [Creating custom fingerprint transformers](https://scikit-mol.readthedocs.io/en/latest/notebooks/12_custom_fingerprint_transformer/)
- [Estimating applicability domain using feature based estimators](https://scikit-mol.readthedocs.io/en/latest/notebooks/13_applicability_domain/)
We also put a software note on ChemRxiv. [https://doi.org/10.26434/chemrxiv-2023-fzqwd](https://doi.org/10.26434/chemrxiv-2023-fzqwd)
## Other use-examples
Scikit-Mol has been featured in blog-posts or used in research, some examples which are listed below:
- [Useful ML package for cheminformatics iwatobipen.wordpress.com](https://iwatobipen.wordpress.com/2023/11/12/useful-ml-package-for-cheminformatics-rdkit-cheminformatics-ml/)
- [Boosted trees Data_in_life_blog](https://jhylin.github.io/Data_in_life_blog/posts/19_ML2-3_Boosted_trees/1_adaboost_xgb.html)
- [Konnektor: A Framework for Using Graph Theory to Plan Networks for Free Energy Calculations](https://pubs.acs.org/doi/abs/10.1021/acs.jcim.4c01710)
- [Moldrug algorithm for an automated ligand binding site exploration by 3D aware molecular enumerations](https://doi.org/10.1186/s13321-025-01022-3)
- [RandomNets Improve Neural Network Regression Performance via Implicit Ensembling](https://chemrxiv.org/engage/chemrxiv/article-details/67656cfa81d2151a02603f48)
- [WAE-DTI: Ensemble-based architecture for drug–target interaction prediction using descriptors and embeddings](https://www.sciencedirect.com/science/article/pii/S2352914824001618)
- [Data Driven Estimation of Molecular Log-Likelihood using Fingerprint Key Counting](https://chemrxiv.org/engage/chemrxiv/article-details/661402ee21291e5d1d646651)
- [AUTONOMOUS DRUG DISCOVERY](https://www.proquest.com/openview/3e830e36bc618f263905a99e787c66c6/1?pq-origsite=gscholar&cbl=18750&diss=y)
- [DrugGym: A testbed for the economics of autonomous drug discovery](https://www.biorxiv.org/content/10.1101/2024.05.28.596296v1.abstract)
## Roadmap and Contributing
_Help wanted!_ Are you a PhD student that want a "side-quest" to procrastinate your thesis writing or are you interested in computational chemistry, cheminformatics or simply with an interest in QSAR modelling, Python Programming open-source software? Do you want to learn more about machine learning with Scikit-Learn? Or do you use scikit-mol for your current work and would like to pay a little back to the project and see it improved as well?
With a little bit of help, this project can be improved much faster! Reach to me (Esben), for a discussion about how we can proceed.
Currently, we are working on fixing some deprecation warnings, it's not the most exciting work, but it's important to maintain a little. Later on we need to go over the scikit-learn compatibility and update to some of their newer features on their estimator classes. We're also brewing on some feature enhancements and tests, such as new fingerprints and a more versatile standardizer.
There are more information about how to contribute to the project in [CONTRIBUTING](https://scikit-mol.readthedocs.io/en/latest/contributing/)
## BUGS
Probably still, please check issues at GitHub and report there
## Contributors
Scikit-Mol has been developed as a community effort with contributions from people from many different companies, consortia, foundations and academic institutions.
[Cheminformania Consulting](https://www.cheminformania.com), [Aptuit](https://www.linkedin.com/company/aptuit/), [BASF](https://www.basf.com), [Bayer AG](https://www.bayer.com), [Boehringer Ingelheim](https://www.boehringer-ingelheim.com/), [Chodera Lab (MSKCC)](https://www.choderalab.org/), [EPAM Systems](https://www.epam.com/),[ETH Zürich](https://ethz.ch/en.html), [Evotec](https://www.evotec.com/), [Johannes Gutenberg University](https://www.uni-mainz.de/en/), [Martin Luther University](https://www.uni-halle.de/?lang=en), [Odyssey Therapeutics](https://odysseytx.com/), [Open Molecular Software Foundation](https://omsf.io/), [Openfree.energy](https://openfree.energy/), [Polish Academy of Sciences](https://pasific.pan.pl/polish-academy-of-sciences/), [Productivista](https://www.productivista.com), [Simulations-Plus Inc.](https://www.simulations-plus.com/), [University of Vienna](https://www.univie.ac.at/en/)
- Esben Jannik Bjerrum [@ebjerrum](https://github.com/ebjerrum), esbenbjerrum+scikit_mol@gmail.com
- Carmen Esposito [@cespos](https://github.com/cespos)
- Son Ha [@son-ha-264](https://github.com/son-ha-264)
- Oh-hyeon Choung [@Ohyeon5](https://github.com/Ohyeon5)
- Andreas Poehlmann [@ap--](https://github.com/ap--)
- Ya Chen [@anya-chen](https://github.com/anya-chen)
- Anton Siomchen [@asiomchen](https://github.com/asiomchen)
- Rafał Bachorz [@rafalbachorz](https://github.com/rafalbachorz)
- Adrien Chaton [@adrienchaton](https://github.com/adrienchaton)
- [@VincentAlexanderScholz](https://github.com/VincentAlexanderScholz)
- [@RiesBen](https://github.com/RiesBen)
- [@enricogandini](https://github.com/enricogandini)
- [@mikemhenry](https://github.com/mikemhenry)
- [@c-feldmann](https://github.com/c-feldmann)
- Mieczyslaw Torchala [@mieczyslaw](https://github.com/mieczyslaw)
- Kyle Barbary [@kbarbary](https://github.com/kbarbary)
================================================
FILE: docs/api/fingerprints.base.md
================================================
`scikit_mol.fingerprints.baseclasses`
::: scikit_mol.fingerprints.baseclasses
options:
filters: []
================================================
FILE: docs/api/scikit_mol.applicability.md
================================================
# `scikit-mol.applicability`
::: scikit_mol.applicability
================================================
FILE: docs/api/scikit_mol.conversions.md
================================================
# `scikit-mol.conversions`
::: scikit_mol.conversions
================================================
FILE: docs/api/scikit_mol.core.md
================================================
# `scikit-mol.core`
::: scikit_mol.core
================================================
FILE: docs/api/scikit_mol.descriptors.md
================================================
# `scikit_mol.descriptors`
::: scikit_mol.descriptors
================================================
FILE: docs/api/scikit_mol.fingerprints.md
================================================
::: scikit_mol.fingerprints
options:
filters: ["!Fps"]
inherited_members:
- transform
================================================
FILE: docs/api/scikit_mol.parallel.md
================================================
# `scikit-mol.parallel`
::: scikit_mol.parallel
================================================
FILE: docs/api/scikit_mol.plotting.md
================================================
# `scikit-mol.plotting`
::: scikit_mol.plotting
================================================
FILE: docs/api/scikit_mol.safeinference.md
================================================
# `scikit-mol.safeinference`
::: scikit_mol.safeinference
================================================
FILE: docs/api/scikit_mol.standardizer.md
================================================
# `scikit-mol.standardizer`
::: scikit_mol.standardizer
================================================
FILE: docs/assets/css/tweak-width.css
================================================
/* snippet from datamol.io */
@media only screen and (min-width: 76.25em) {
.md-main__inner {
max-width: none;
padding-left: 2em;
padding-left: 2em;
}
.md-sidebar--primary {
left: 0;
}
.md-sidebar--secondary {
right: 0;
margin-left: 0;
-webkit-transform: none;
transform: none;
}
}
================================================
FILE: docs/assets/js/readthedocs.js
================================================
// Add server-side search
document.addEventListener("DOMContentLoaded", function(event) {
// Trigger Read the Docs' search addon instead of Material MkDocs default
document.querySelector(".md-search__input").addEventListener("focus", (e) => {
const event = new CustomEvent("readthedocs-search-show");
document.dispatchEvent(event);
});
});
// Use CustomEvent to generate the version selector
document.addEventListener(
"readthedocs-addons-data-ready",
function (event) {
const config = event.detail.data();
const versioning = `
<div class="md-version">
<button class="md-version__current" aria-label="Select version">
${config.versions.current.slug}
</button>
<ul class="md-version__list">
${ config.versions.active.map(
(version) => `
<li class="md-version__item">
<a href="${ version.urls.documentation }" class="md-version__link">
${ version.slug }
</a>
</li>`).join("\n")}
</ul>
</div>`;
document.querySelector(".md-header__topic").insertAdjacentHTML("beforeend", versioning);
});
================================================
FILE: docs/contributing.md
================================================
# Contribution
Thanks for your interest in contributing to the project. Please read on in the sections that apply.
## Discord Server
We have a discord server for chats and discussion, ask for an invitation: esbenbjerrum+scikit_mol@gmail.com
## Installation
We use [uv] for managing the virtual environment. You can install it with:
```sh
curl -LsSf https://astral.sh/uv/install.sh | sh
```
For more information and other installation methods see [documentation](https://docs.astral.sh/uv/)
Clone and install in editable more like this
```sh
git clone git@github.com:EBjerrum/scikit-mol.git
uv sync --dev
```
After that you could either activate venv and run commands as usual:
```sh
source .venv/bin/activate
pytest -v --cov=scikit_mol
```
or use `uv run` to run commands in the venv (automatically check that environment is up to date):
```sh
uv run pytest -v --cov=scikit_mol
```
`uv.lock` contains the pinned dependencies and is used to recreate the environment. Make sure to update it when adding new dependencies. (handled automatically when using `uv run` or manually with `uv lock`)
## Code Quality
We use [ruff](https://github.com/astral-sh/ruff) to lint and format the code. The configuration is in the [ruff.toml](https://github.com/EBjerrum/scikit-mol/blob/main/ruff.toml) file. The CI will fail if the code is not formatted correctly. You can run the linter and formatter locally with:
```sh
ruff format scikit_mol
ruff check --fix scikit_mol
```
We also have pre-commit hooks that will run the linter and formatter before you commit, and we highly recommend you to use them. You can install them with:
```sh
pre-commit install
```
For more information on pre-commit see [documentation](https://pre-commit.com/).
## Adding transformers
The projects transformers subclasses the BaseEstimator and Transformer mixin classes from sklearn. Their documentation page contains information on what requisites are necessary [https://scikit-learn.org/stable/developers/develop.html](https://scikit-learn.org/stable/developers/develop.html). Most notably:
- The arguments accepted by **init** should all be keyword arguments with a default value.
- Every keyword argument accepted by **init** should correspond to an attribute on the instance.
- - There should be no logic, not even input validation, and the parameters should not be changed inside the **init** function.
Scikit-learn classes depends on this in order to for e.g. the `.get_params()`, `.set_params()`, cloning abilities and representation rendering to work.
- With the new error handling, falsy objects need to return masked arrays or arrays with `np.nan` (for float dtype)
### Tips
- We have observed that some external tools used "exotic" types such at `np.int64` when doing hyperparameter tuning. It is thus necessary do defensive programming to cast parameters to standard types before making calls to rdkit functions. This behaviour is tested in the `test_parameter_types` test
- `@property` getters and setters can be used if additional logic are needed when setting the attributes from the keywords while at the same time adhering to the sklearn requisites.
- Some RDKit features uses objects as generators which may not be picklable. If instantiated and added to the object as an attribute rather than instantiated at each function call for individual molecules, these should thus be removed and recreated via overloading the `_get_state()` and `_set_state()` methods.
See [MHFingerprintTransformer](https://github.com/EBjerrum/scikit-mol/blob/main/scikit_mol/fingerprints/minhash.py#L11) for an example.
## Module organisation
Currently, we have multiple classes in the same file, if they are the same type. This may change in the future.
## Docstrings
We should ultimately consolidate on the NumPy docstring format [https://numpydoc.readthedocs.io/en/latest/format.html#docstring-standard](https://numpydoc.readthedocs.io/en/latest/format.html#docstring-standard) which is also used by SciPy and other scikits.
## Typehints
parameters and output of methods should preferably be using typehints
## Testing
New transformer classes should be added to the pytest tests in the tests directory. A lot of tests are made general, and tests aspects of the transformers that are needed for sklearn compliance or other features. The transformer is then added to a fixture and can be added to the lists of transformer objects that are run by these test. Specific tests may also be necessary to set up. As example the assert_transformer_set_params needs a list of non-default parameters in order to set the set_params functionality of the object.
Scikit-Learn has a check_estimator that we should strive to get to work, some classes of scikit-mol currently does not pass all tests.
## Notebooks
Another way of contributing is by providing notebooks with examples on how to use the project to build models together with Scikit-Learn and other tools. There are .ipynb files in the `docs/notebooks` and .py files in the `script` subfolder as the first are useful for online rendering in the documentation, whereas the latter is useful for sub version control.
If you want to create new notebook you can first create .ipynb file, and then you run `make sync-notebooks` to create the corresponding .py file for the commit.
If you updated any of the existing py/ipynb files, you can run `make sync-notebooks` to update the outdated file in the pair. The .py files are used for nice diffs, and the .ipynb files are used for rendering in the documentation.
`make sync-notebooks` will sync all the notebooks with the .py files in the `scripts` folder.
`make run-notebooks` will sync, run and save the notebooks, expects an ipython kernel with scikit-mol installed.
If you only want to sync and run a single notebook if you are working on updating one you can adapt the commands from the MakeFile
```bash
uv run jupytext --set-formats docs//notebooks//ipynb,docs//notebooks//scripts//py:percent --sync docs/notebooks/XX_YourNotebook.ipynb
uv run ruff format "docs/notebooks/XX_YourNotebook.ipynb"
uv run jupytext --execute docs/notebooks/XX_YourNotebook.ipynb
```
## Documentation
We use [MkDocs](https://www.mkdocs.org/) to host scikit-mol documentation on ReadTheDocs. If you're making some changes to the documentation or just what to see live preview of your docstring you can take a look at rendered documentation.
Install documentation dependencies:
```sh
uv sync --group docs
```
Start server:
```sh
uv run mkdocs serve
```
Go to http://127.0.0.1:8000 to see the documentation
## Release
### PyPi
To release a new version on PyPi, you need to create and push new tag in v0.0.0 format then workflow will automatically build and upload the package to PyPi. Additionally, the release draft with autogenerated notes and signed distribution files will be added to the GitHub release page. What is left is to publish the release, after checking that the notes are correct.
### Conda
When you make a release on PyPi the conda-forge bot will automatically make a PR that updates the Conda feedstock to the new version. If new main package dependencies or pins are changed on dependencies, those changes will need to be added to the PR in the feedstocks [https://github.com/conda-forge/scikit-mol-feedstock/blob/main/recipe/meta.yaml](https://github.com/conda-forge/scikit-mol-feedstock/blob/main/recipe/meta.yaml). I.e. the run section needs to correspond to the `dependencies = [` section in pyproject.toml. If there is just a pure code change then all we have do to is merge in the PR and that will update the package on conda-forge. See https://conda-forge.org/docs/maintainer/updating_pkgs/ for more information
================================================
FILE: docs/index.md
================================================
# scikit-mol

[]()
[](https://pypi.org/project/scikit-mol/)
[](https://anaconda.org/conda-forge/scikit-mol)
[](#)
[](https://www.rdkit.org/)
[](https://github.com/astral-sh/ruff)
## Scikit-Learn classes for molecular vectorization using RDKit
The intended usage is to be able to add molecular vectorization directly into scikit-learn pipelines, so that the final model directly predict on RDKit molecules or SMILES strings
As example with the needed scikit-learn and -mol imports and RDKit mol objects in the mol_list_train and \_test lists:
pipe = Pipeline([('mol_transformer', MorganFingerprintTransformer()), ('Regressor', Ridge())])
pipe.fit(mol_list_train, y_train)
pipe.score(mol_list_test, y_test)
pipe.predict([Chem.MolFromSmiles('c1ccccc1C(=O)C')])
>>> array([4.93858815])
The scikit-learn compatibility should also make it easier to include the fingerprinting step in hyperparameter tuning with scikit-learns utilities
The first draft for the project was created at the [RDKIT UGM 2022 hackathon](https://github.com/rdkit/UGM_2022) 2022-October-14
## Installation
Users can install latest tagged release from pip
```sh
pip install scikit-mol
```
or from conda-forge
```sh
conda install -c conda-forge scikit-mol
```
The conda forge package should get updated shortly after a new tagged release on pypi.
Bleeding edge
```sh
pip install git+https://github.com/EBjerrum/scikit-mol.git
```
## Documentation
Example notebooks and API documentation are now hosted on [https://scikit-mol.readthedocs.io](https://scikit-mol.readthedocs.io/en/latest/)
- [Basic Usage and fingerprint transformers](https://scikit-mol.readthedocs.io/en/latest/notebooks/01_basic_usage/)
- [Descriptor transformer](https://scikit-mol.readthedocs.io/en/latest/notebooks/02_descriptor_transformer/)
- [Pipelining with Scikit-Learn classes](https://scikit-mol.readthedocs.io/en/latest/notebooks/03_example_pipeline/)
- [Molecular standardization](https://scikit-mol.readthedocs.io/en/latest/notebooks/04_standardizer/)
- [Sanitizing SMILES input](https://scikit-mol.readthedocs.io/en/latest/notebooks/05_smiles_sanitization/)
- [Integrated hyperparameter tuning of Scikit-Learn estimator and Scikit-Mol transformer](https://scikit-mol.readthedocs.io/en/latest/notebooks/06_hyperparameter_tuning/)
- [Using parallel execution to speed up descriptor and fingerprint calculations](https://scikit-mol.readthedocs.io/en/latest/notebooks/07_parallel_transforms/)
- [Using skopt for hyperparameter tuning](https://scikit-mol.readthedocs.io/en/latest/notebooks/08_external_library_skopt/)
- [Testing different fingerprints as part of the hyperparameter optimization](https://scikit-mol.readthedocs.io/en/latest/notebooks/09_Combinatorial_Method_Usage_with_FingerPrint_Transformers/)
- [Using pandas output for easy feature importance analysis and combine pre-existing values with new computations](https://scikit-mol.readthedocs.io/en/latest/notebooks/10_pipeline_pandas_output/)
- [Working with pipelines and estimators in safe inference mode for handling prediction on batches with invalid smiles or molecules](https://scikit-mol.readthedocs.io/en/latest/notebooks/11_safe_inference/)
- [Creating custom fingerprint transformers](https://scikit-mol.readthedocs.io/en/latest/notebooks/12_custom_fingerprint_transformer/)
- [Estimating applicability domain using feature based estimators](https://scikit-mol.readthedocs.io/en/latest/notebooks/13_applicability_domain/)
We also put a software note on ChemRxiv. [https://doi.org/10.26434/chemrxiv-2023-fzqwd](https://doi.org/10.26434/chemrxiv-2023-fzqwd)
## Other use-examples
Scikit-Mol has been featured in blog-posts or used in research, some examples which are listed below:
- [Useful ML package for cheminformatics iwatobipen.wordpress.com](https://iwatobipen.wordpress.com/2023/11/12/useful-ml-package-for-cheminformatics-rdkit-cheminformatics-ml/)
- [Boosted trees Data_in_life_blog](https://jhylin.github.io/Data_in_life_blog/posts/19_ML2-3_Boosted_trees/1_adaboost_xgb.html)
- [Konnektor: A Framework for Using Graph Theory to Plan Networks for Free Energy Calculations](https://pubs.acs.org/doi/abs/10.1021/acs.jcim.4c01710)
- [Moldrug algorithm for an automated ligand binding site exploration by 3D aware molecular enumerations](https://doi.org/10.1186/s13321-025-01022-3)
- [RandomNets Improve Neural Network Regression Performance via Implicit Ensembling](https://chemrxiv.org/engage/chemrxiv/article-details/67656cfa81d2151a02603f48)
- [WAE-DTI: Ensemble-based architecture for drug–target interaction prediction using descriptors and embeddings](https://www.sciencedirect.com/science/article/pii/S2352914824001618)
- [Data Driven Estimation of Molecular Log-Likelihood using Fingerprint Key Counting](https://chemrxiv.org/engage/chemrxiv/article-details/661402ee21291e5d1d646651)
- [AUTONOMOUS DRUG DISCOVERY](https://www.proquest.com/openview/3e830e36bc618f263905a99e787c66c6/1?pq-origsite=gscholar&cbl=18750&diss=y)
- [DrugGym: A testbed for the economics of autonomous drug discovery](https://www.biorxiv.org/content/10.1101/2024.05.28.596296v1.abstract)
## Roadmap and Contributing
_Help wanted!_ Are you a PhD student that want a "side-quest" to procrastinate your thesis writing or are you interested in computational chemistry, cheminformatics or simply with an interest in QSAR modelling, Python Programming open-source software? Do you want to learn more about machine learning with Scikit-Learn? Or do you use scikit-mol for your current work and would like to pay a little back to the project and see it improved as well?
With a little bit of help, this project can be improved much faster! Reach to me (Esben), for a discussion about how we can proceed.
Currently, we are working on fixing some deprecation warnings, it's not the most exciting work, but it's important to maintain a little. Later on we need to go over the scikit-learn compatibility and update to some of their newer features on their estimator classes. We're also brewing on some feature enhancements and tests, such as new fingerprints and a more versatile standardizer.
There are more information about how to contribute to the project in [CONTRIBUTING](https://scikit-mol.readthedocs.io/en/latest/contributing/)
## BUGS
Probably still, please check issues at GitHub and report there
## Contributors
Scikit-Mol has been developed as a community effort with contributions from people from many different companies, consortia, foundations and academic institutions.
[Cheminformania Consulting](https://www.cheminformania.com), [Aptuit](https://www.linkedin.com/company/aptuit/), [BASF](https://www.basf.com), [Bayer AG](https://www.bayer.com), [Boehringer Ingelheim](https://www.boehringer-ingelheim.com/), [Chodera Lab (MSKCC)](https://www.choderalab.org/), [EPAM Systems](https://www.epam.com/),[ETH Zürich](https://ethz.ch/en.html), [Evotec](https://www.evotec.com/), [Johannes Gutenberg University](https://www.uni-mainz.de/en/), [Martin Luther University](https://www.uni-halle.de/?lang=en), [Odyssey Therapeutics](https://odysseytx.com/), [Open Molecular Software Foundation](https://omsf.io/), [Openfree.energy](https://openfree.energy/), [Polish Academy of Sciences](https://pasific.pan.pl/polish-academy-of-sciences/), [Productivista](https://www.productivista.com), [Simulations-Plus Inc.](https://www.simulations-plus.com/), [University of Vienna](https://www.univie.ac.at/en/)
- Esben Jannik Bjerrum [@ebjerrum](https://github.com/ebjerrum), esbenbjerrum+scikit_mol@gmail.com
- Carmen Esposito [@cespos](https://github.com/cespos)
- Son Ha [@son-ha-264](https://github.com/son-ha-264)
- Oh-hyeon Choung [@Ohyeon5](https://github.com/Ohyeon5)
- Andreas Poehlmann [@ap--](https://github.com/ap--)
- Ya Chen [@anya-chen](https://github.com/anya-chen)
- Anton Siomchen [@asiomchen](https://github.com/asiomchen)
- Rafał Bachorz [@rafalbachorz](https://github.com/rafalbachorz)
- Adrien Chaton [@adrienchaton](https://github.com/adrienchaton)
- [@VincentAlexanderScholz](https://github.com/VincentAlexanderScholz)
- [@RiesBen](https://github.com/RiesBen)
- [@enricogandini](https://github.com/enricogandini)
- [@mikemhenry](https://github.com/mikemhenry)
- [@c-feldmann](https://github.com/c-feldmann)
- Mieczyslaw Torchala [@mieczyslaw](https://github.com/mieczyslaw)
- Kyle Barbary [@kbarbary](https://github.com/kbarbary)
================================================
FILE: docs/notebooks/01_basic_usage.ipynb
================================================
{
"cells": [
{
"cell_type": "markdown",
"id": "aa079ac3",
"metadata": {},
"source": [
"# Scikit-Mol\n",
"## scikit-learn compatible RDKit transformers\n",
"\n",
"Scikit-mol is a collection of scikit-learn compatible transformer classes that integrate into the scikit-learn framework and thus bridge between the molecular information in form of RDKit molecules or SMILES and the machine learning framework from scikit-learn\n"
]
},
{
"cell_type": "markdown",
"id": "76d24789",
"metadata": {},
"source": [
"The transformer classes are easy to load, configure and use to process molecular information into vectorized formats using fingerprinters or collections of descriptors. For demonstration purposes, let's load a MorganTransformer, that can convert a list of RDKit molecular objects into a numpy array of morgan fingerprints. First create some molecules from SMILES strings."
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "2c8cad03",
"metadata": {
"execution": {
"iopub.execute_input": "2025-05-08T16:22:29.627872Z",
"iopub.status.busy": "2025-05-08T16:22:29.627571Z",
"iopub.status.idle": "2025-05-08T16:22:29.632065Z",
"shell.execute_reply": "2025-05-08T16:22:29.631373Z"
}
},
"outputs": [],
"source": [
"from IPython.core.display import HTML"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "8d5b2333",
"metadata": {
"execution": {
"iopub.execute_input": "2025-05-08T16:22:29.634712Z",
"iopub.status.busy": "2025-05-08T16:22:29.634423Z",
"iopub.status.idle": "2025-05-08T16:22:29.845389Z",
"shell.execute_reply": "2025-05-08T16:22:29.844169Z"
}
},
"outputs": [],
"source": [
"from rdkit import Chem\n",
"\n",
"smiles_strings = [\n",
" \"C12C([C@@H](OC(C=3C=CC(=CC3)F)C=4C=CC(=CC4)F)CC(N1CCCCCC5=CC=CC=C5)CC2)C(=O)OC\",\n",
" \"O(C1=NC=C2C(CN(CC2=C1)C)C3=CC=C(OC)C=C3)CCCN(CC)CC\",\n",
" \"O=S(=O)(N(CC=1C=CC2=CC=CC=C2C1)[C@@H]3CCNC3)C\",\n",
" \"C1(=C2C(CCCC2O)=NC=3C1=CC=CC3)NCC=4C=CC(=CC4)Cl\",\n",
" \"C1NC[C@@H](C1)[C@H](OC=2C=CC(=NC2C)OC)CC(C)C\",\n",
" \"FC(F)(F)C=1C(CN(C2CCNCC2)CC(CC)CC)=CC=CC1\",\n",
"]\n",
"\n",
"mols = [Chem.MolFromSmiles(smiles) for smiles in smiles_strings]"
]
},
{
"cell_type": "markdown",
"id": "b9a588c7",
"metadata": {},
"source": [
"Next we import the Morgan fingerprint transformer"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "0a625dda",
"metadata": {
"execution": {
"iopub.execute_input": "2025-05-08T16:22:29.850211Z",
"iopub.status.busy": "2025-05-08T16:22:29.848822Z",
"iopub.status.idle": "2025-05-08T16:22:30.986417Z",
"shell.execute_reply": "2025-05-08T16:22:30.984810Z"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"MorganFingerprintTransformer(radius=3)\n"
]
}
],
"source": [
"from scikit_mol.fingerprints import MorganFingerprintTransformer\n",
"\n",
"transformer = MorganFingerprintTransformer(radius=3)\n",
"print(transformer)"
]
},
{
"cell_type": "markdown",
"id": "355610d1",
"metadata": {},
"source": [
"It actually renders as a cute little interactive block in the Jupyter notebook and lists the options that are not the default values. If we print it, it also gives the information on the settings.\n",
"\n",
"\n",
"\n",
"The graphical representation is probably nice when working with complex pipelines. However, the graphical representation doesn't work when previewing the notebook on GitHub and sometimes nbviewer.org, so for the rest of these scikit-mol notebook examples, we'll use the print() output."
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "9a801d0f",
"metadata": {
"execution": {
"iopub.execute_input": "2025-05-08T16:22:30.991850Z",
"iopub.status.busy": "2025-05-08T16:22:30.990911Z",
"iopub.status.idle": "2025-05-08T16:22:31.011512Z",
"shell.execute_reply": "2025-05-08T16:22:31.010309Z"
}
},
"outputs": [
{
"data": {
"text/html": [
"<style>#sk-container-id-1 {\n",
" /* Definition of color scheme common for light and dark mode */\n",
" --sklearn-color-text: #000;\n",
" --sklearn-color-text-muted: #666;\n",
" --sklearn-color-line: gray;\n",
" /* Definition of color scheme for unfitted estimators */\n",
" --sklearn-color-unfitted-level-0: #fff5e6;\n",
" --sklearn-color-unfitted-level-1: #f6e4d2;\n",
" --sklearn-color-unfitted-level-2: #ffe0b3;\n",
" --sklearn-color-unfitted-level-3: chocolate;\n",
" /* Definition of color scheme for fitted estimators */\n",
" --sklearn-color-fitted-level-0: #f0f8ff;\n",
" --sklearn-color-fitted-level-1: #d4ebff;\n",
" --sklearn-color-fitted-level-2: #b3dbfd;\n",
" --sklearn-color-fitted-level-3: cornflowerblue;\n",
"\n",
" /* Specific color for light theme */\n",
" --sklearn-color-text-on-default-background: var(--sg-text-color, var(--theme-code-foreground, var(--jp-content-font-color1, black)));\n",
" --sklearn-color-background: var(--sg-background-color, var(--theme-background, var(--jp-layout-color0, white)));\n",
" --sklearn-color-border-box: var(--sg-text-color, var(--theme-code-foreground, var(--jp-content-font-color1, black)));\n",
" --sklearn-color-icon: #696969;\n",
"\n",
" @media (prefers-color-scheme: dark) {\n",
" /* Redefinition of color scheme for dark theme */\n",
" --sklearn-color-text-on-default-background: var(--sg-text-color, var(--theme-code-foreground, var(--jp-content-font-color1, white)));\n",
" --sklearn-color-background: var(--sg-background-color, var(--theme-background, var(--jp-layout-color0, #111)));\n",
" --sklearn-color-border-box: var(--sg-text-color, var(--theme-code-foreground, var(--jp-content-font-color1, white)));\n",
" --sklearn-color-icon: #878787;\n",
" }\n",
"}\n",
"\n",
"#sk-container-id-1 {\n",
" color: var(--sklearn-color-text);\n",
"}\n",
"\n",
"#sk-container-id-1 pre {\n",
" padding: 0;\n",
"}\n",
"\n",
"#sk-container-id-1 input.sk-hidden--visually {\n",
" border: 0;\n",
" clip: rect(1px 1px 1px 1px);\n",
" clip: rect(1px, 1px, 1px, 1px);\n",
" height: 1px;\n",
" margin: -1px;\n",
" overflow: hidden;\n",
" padding: 0;\n",
" position: absolute;\n",
" width: 1px;\n",
"}\n",
"\n",
"#sk-container-id-1 div.sk-dashed-wrapped {\n",
" border: 1px dashed var(--sklearn-color-line);\n",
" margin: 0 0.4em 0.5em 0.4em;\n",
" box-sizing: border-box;\n",
" padding-bottom: 0.4em;\n",
" background-color: var(--sklearn-color-background);\n",
"}\n",
"\n",
"#sk-container-id-1 div.sk-container {\n",
" /* jupyter's `normalize.less` sets `[hidden] { display: none; }`\n",
" but bootstrap.min.css set `[hidden] { display: none !important; }`\n",
" so we also need the `!important` here to be able to override the\n",
" default hidden behavior on the sphinx rendered scikit-learn.org.\n",
" See: https://github.com/scikit-learn/scikit-learn/issues/21755 */\n",
" display: inline-block !important;\n",
" position: relative;\n",
"}\n",
"\n",
"#sk-container-id-1 div.sk-text-repr-fallback {\n",
" display: none;\n",
"}\n",
"\n",
"div.sk-parallel-item,\n",
"div.sk-serial,\n",
"div.sk-item {\n",
" /* draw centered vertical line to link estimators */\n",
" background-image: linear-gradient(var(--sklearn-color-text-on-default-background), var(--sklearn-color-text-on-default-background));\n",
" background-size: 2px 100%;\n",
" background-repeat: no-repeat;\n",
" background-position: center center;\n",
"}\n",
"\n",
"/* Parallel-specific style estimator block */\n",
"\n",
"#sk-container-id-1 div.sk-parallel-item::after {\n",
" content: \"\";\n",
" width: 100%;\n",
" border-bottom: 2px solid var(--sklearn-color-text-on-default-background);\n",
" flex-grow: 1;\n",
"}\n",
"\n",
"#sk-container-id-1 div.sk-parallel {\n",
" display: flex;\n",
" align-items: stretch;\n",
" justify-content: center;\n",
" background-color: var(--sklearn-color-background);\n",
" position: relative;\n",
"}\n",
"\n",
"#sk-container-id-1 div.sk-parallel-item {\n",
" display: flex;\n",
" flex-direction: column;\n",
"}\n",
"\n",
"#sk-container-id-1 div.sk-parallel-item:first-child::after {\n",
" align-self: flex-end;\n",
" width: 50%;\n",
"}\n",
"\n",
"#sk-container-id-1 div.sk-parallel-item:last-child::after {\n",
" align-self: flex-start;\n",
" width: 50%;\n",
"}\n",
"\n",
"#sk-container-id-1 div.sk-parallel-item:only-child::after {\n",
" width: 0;\n",
"}\n",
"\n",
"/* Serial-specific style estimator block */\n",
"\n",
"#sk-container-id-1 div.sk-serial {\n",
" display: flex;\n",
" flex-direction: column;\n",
" align-items: center;\n",
" background-color: var(--sklearn-color-background);\n",
" padding-right: 1em;\n",
" padding-left: 1em;\n",
"}\n",
"\n",
"\n",
"/* Toggleable style: style used for estimator/Pipeline/ColumnTransformer box that is\n",
"clickable and can be expanded/collapsed.\n",
"- Pipeline and ColumnTransformer use this feature and define the default style\n",
"- Estimators will overwrite some part of the style using the `sk-estimator` class\n",
"*/\n",
"\n",
"/* Pipeline and ColumnTransformer style (default) */\n",
"\n",
"#sk-container-id-1 div.sk-toggleable {\n",
" /* Default theme specific background. It is overwritten whether we have a\n",
" specific estimator or a Pipeline/ColumnTransformer */\n",
" background-color: var(--sklearn-color-background);\n",
"}\n",
"\n",
"/* Toggleable label */\n",
"#sk-container-id-1 label.sk-toggleable__label {\n",
" cursor: pointer;\n",
" display: flex;\n",
" width: 100%;\n",
" margin-bottom: 0;\n",
" padding: 0.5em;\n",
" box-sizing: border-box;\n",
" text-align: center;\n",
" align-items: start;\n",
" justify-content: space-between;\n",
" gap: 0.5em;\n",
"}\n",
"\n",
"#sk-container-id-1 label.sk-toggleable__label .caption {\n",
" font-size: 0.6rem;\n",
" font-weight: lighter;\n",
" color: var(--sklearn-color-text-muted);\n",
"}\n",
"\n",
"#sk-container-id-1 label.sk-toggleable__label-arrow:before {\n",
" /* Arrow on the left of the label */\n",
" content: \"▸\";\n",
" float: left;\n",
" margin-right: 0.25em;\n",
" color: var(--sklearn-color-icon);\n",
"}\n",
"\n",
"#sk-container-id-1 label.sk-toggleable__label-arrow:hover:before {\n",
" color: var(--sklearn-color-text);\n",
"}\n",
"\n",
"/* Toggleable content - dropdown */\n",
"\n",
"#sk-container-id-1 div.sk-toggleable__content {\n",
" max-height: 0;\n",
" max-width: 0;\n",
" overflow: hidden;\n",
" text-align: left;\n",
" /* unfitted */\n",
" background-color: var(--sklearn-color-unfitted-level-0);\n",
"}\n",
"\n",
"#sk-container-id-1 div.sk-toggleable__content.fitted {\n",
" /* fitted */\n",
" background-color: var(--sklearn-color-fitted-level-0);\n",
"}\n",
"\n",
"#sk-container-id-1 div.sk-toggleable__content pre {\n",
" margin: 0.2em;\n",
" border-radius: 0.25em;\n",
" color: var(--sklearn-color-text);\n",
" /* unfitted */\n",
" background-color: var(--sklearn-color-unfitted-level-0);\n",
"}\n",
"\n",
"#sk-container-id-1 div.sk-toggleable__content.fitted pre {\n",
" /* unfitted */\n",
" background-color: var(--sklearn-color-fitted-level-0);\n",
"}\n",
"\n",
"#sk-container-id-1 input.sk-toggleable__control:checked~div.sk-toggleable__content {\n",
" /* Expand drop-down */\n",
" max-height: 200px;\n",
" max-width: 100%;\n",
" overflow: auto;\n",
"}\n",
"\n",
"#sk-container-id-1 input.sk-toggleable__control:checked~label.sk-toggleable__label-arrow:before {\n",
" content: \"▾\";\n",
"}\n",
"\n",
"/* Pipeline/ColumnTransformer-specific style */\n",
"\n",
"#sk-container-id-1 div.sk-label input.sk-toggleable__control:checked~label.sk-toggleable__label {\n",
" color: var(--sklearn-color-text);\n",
" background-color: var(--sklearn-color-unfitted-level-2);\n",
"}\n",
"\n",
"#sk-container-id-1 div.sk-label.fitted input.sk-toggleable__control:checked~label.sk-toggleable__label {\n",
" background-color: var(--sklearn-color-fitted-level-2);\n",
"}\n",
"\n",
"/* Estimator-specific style */\n",
"\n",
"/* Colorize estimator box */\n",
"#sk-container-id-1 div.sk-estimator input.sk-toggleable__control:checked~label.sk-toggleable__label {\n",
" /* unfitted */\n",
" background-color: var(--sklearn-color-unfitted-level-2);\n",
"}\n",
"\n",
"#sk-container-id-1 div.sk-estimator.fitted input.sk-toggleable__control:checked~label.sk-toggleable__label {\n",
" /* fitted */\n",
" background-color: var(--sklearn-color-fitted-level-2);\n",
"}\n",
"\n",
"#sk-container-id-1 div.sk-label label.sk-toggleable__label,\n",
"#sk-container-id-1 div.sk-label label {\n",
" /* The background is the default theme color */\n",
" color: var(--sklearn-color-text-on-default-background);\n",
"}\n",
"\n",
"/* On hover, darken the color of the background */\n",
"#sk-container-id-1 div.sk-label:hover label.sk-toggleable__label {\n",
" color: var(--sklearn-color-text);\n",
" background-color: var(--sklearn-color-unfitted-level-2);\n",
"}\n",
"\n",
"/* Label box, darken color on hover, fitted */\n",
"#sk-container-id-1 div.sk-label.fitted:hover label.sk-toggleable__label.fitted {\n",
" color: var(--sklearn-color-text);\n",
" background-color: var(--sklearn-color-fitted-level-2);\n",
"}\n",
"\n",
"/* Estimator label */\n",
"\n",
"#sk-container-id-1 div.sk-label label {\n",
" font-family: monospace;\n",
" font-weight: bold;\n",
" display: inline-block;\n",
" line-height: 1.2em;\n",
"}\n",
"\n",
"#sk-container-id-1 div.sk-label-container {\n",
" text-align: center;\n",
"}\n",
"\n",
"/* Estimator-specific */\n",
"#sk-container-id-1 div.sk-estimator {\n",
" font-family: monospace;\n",
" border: 1px dotted var(--sklearn-color-border-box);\n",
" border-radius: 0.25em;\n",
" box-sizing: border-box;\n",
" margin-bottom: 0.5em;\n",
" /* unfitted */\n",
" background-color: var(--sklearn-color-unfitted-level-0);\n",
"}\n",
"\n",
"#sk-container-id-1 div.sk-estimator.fitted {\n",
" /* fitted */\n",
" background-color: var(--sklearn-color-fitted-level-0);\n",
"}\n",
"\n",
"/* on hover */\n",
"#sk-container-id-1 div.sk-estimator:hover {\n",
" /* unfitted */\n",
" background-color: var(--sklearn-color-unfitted-level-2);\n",
"}\n",
"\n",
"#sk-container-id-1 div.sk-estimator.fitted:hover {\n",
" /* fitted */\n",
" background-color: var(--sklearn-color-fitted-level-2);\n",
"}\n",
"\n",
"/* Specification for estimator info (e.g. \"i\" and \"?\") */\n",
"\n",
"/* Common style for \"i\" and \"?\" */\n",
"\n",
".sk-estimator-doc-link,\n",
"a:link.sk-estimator-doc-link,\n",
"a:visited.sk-estimator-doc-link {\n",
" float: right;\n",
" font-size: smaller;\n",
" line-height: 1em;\n",
" font-family: monospace;\n",
" background-color: var(--sklearn-color-background);\n",
" border-radius: 1em;\n",
" height: 1em;\n",
" width: 1em;\n",
" text-decoration: none !important;\n",
" margin-left: 0.5em;\n",
" text-align: center;\n",
" /* unfitted */\n",
" border: var(--sklearn-color-unfitted-level-1) 1pt solid;\n",
" color: var(--sklearn-color-unfitted-level-1);\n",
"}\n",
"\n",
".sk-estimator-doc-link.fitted,\n",
"a:link.sk-estimator-doc-link.fitted,\n",
"a:visited.sk-estimator-doc-link.fitted {\n",
" /* fitted */\n",
" border: var(--sklearn-color-fitted-level-1) 1pt solid;\n",
" color: var(--sklearn-color-fitted-level-1);\n",
"}\n",
"\n",
"/* On hover */\n",
"div.sk-estimator:hover .sk-estimator-doc-link:hover,\n",
".sk-estimator-doc-link:hover,\n",
"div.sk-label-container:hover .sk-estimator-doc-link:hover,\n",
".sk-estimator-doc-link:hover {\n",
" /* unfitted */\n",
" background-color: var(--sklearn-color-unfitted-level-3);\n",
" color: var(--sklearn-color-background);\n",
" text-decoration: none;\n",
"}\n",
"\n",
"div.sk-estimator.fitted:hover .sk-estimator-doc-link.fitted:hover,\n",
".sk-estimator-doc-link.fitted:hover,\n",
"div.sk-label-container:hover .sk-estimator-doc-link.fitted:hover,\n",
".sk-estimator-doc-link.fitted:hover {\n",
" /* fitted */\n",
" background-color: var(--sklearn-color-fitted-level-3);\n",
" color: var(--sklearn-color-background);\n",
" text-decoration: none;\n",
"}\n",
"\n",
"/* Span, style for the box shown on hovering the info icon */\n",
".sk-estimator-doc-link span {\n",
" display: none;\n",
" z-index: 9999;\n",
" position: relative;\n",
" font-weight: normal;\n",
" right: .2ex;\n",
" padding: .5ex;\n",
" margin: .5ex;\n",
" width: min-content;\n",
" min-width: 20ex;\n",
" max-width: 50ex;\n",
" color: var(--sklearn-color-text);\n",
" box-shadow: 2pt 2pt 4pt #999;\n",
" /* unfitted */\n",
" background: var(--sklearn-color-unfitted-level-0);\n",
" border: .5pt solid var(--sklearn-color-unfitted-level-3);\n",
"}\n",
"\n",
".sk-estimator-doc-link.fitted span {\n",
" /* fitted */\n",
" background: var(--sklearn-color-fitted-level-0);\n",
" border: var(--sklearn-color-fitted-level-3);\n",
"}\n",
"\n",
".sk-estimator-doc-link:hover span {\n",
" display: block;\n",
"}\n",
"\n",
"/* \"?\"-specific style due to the `<a>` HTML tag */\n",
"\n",
"#sk-container-id-1 a.estimator_doc_link {\n",
" float: right;\n",
" font-size: 1rem;\n",
" line-height: 1em;\n",
" font-family: monospace;\n",
" background-color: var(--sklearn-color-background);\n",
" border-radius: 1rem;\n",
" height: 1rem;\n",
" width: 1rem;\n",
" text-decoration: none;\n",
" /* unfitted */\n",
" color: var(--sklearn-color-unfitted-level-1);\n",
" border: var(--sklearn-color-unfitted-level-1) 1pt solid;\n",
"}\n",
"\n",
"#sk-container-id-1 a.estimator_doc_link.fitted {\n",
" /* fitted */\n",
" border: var(--sklearn-color-fitted-level-1) 1pt solid;\n",
" color: var(--sklearn-color-fitted-level-1);\n",
"}\n",
"\n",
"/* On hover */\n",
"#sk-container-id-1 a.estimator_doc_link:hover {\n",
" /* unfitted */\n",
" background-color: var(--sklearn-color-unfitted-level-3);\n",
" color: var(--sklearn-color-background);\n",
" text-decoration: none;\n",
"}\n",
"\n",
"#sk-container-id-1 a.estimator_doc_link.fitted:hover {\n",
" /* fitted */\n",
" background-color: var(--sklearn-color-fitted-level-3);\n",
"}\n",
"</style><div id=\"sk-container-id-1\" class=\"sk-top-container\"><div class=\"sk-text-repr-fallback\"><pre>MorganFingerprintTransformer(radius=3)</pre><b>In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. <br />On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.</b></div><div class=\"sk-container\" hidden><div class=\"sk-item\"><div class=\"sk-estimator fitted sk-toggleable\"><input class=\"sk-toggleable__control sk-hidden--visually\" id=\"sk-estimator-id-1\" type=\"checkbox\" checked><label for=\"sk-estimator-id-1\" class=\"sk-toggleable__label fitted sk-toggleable__label-arrow\"><div><div>MorganFingerprintTransformer</div></div><div><a class=\"sk-estimator-doc-link fitted\" rel=\"noreferrer\" target=\"_blank\" href=\"https://scikit-mol.readthedocs.org/en/latest/api/scikit_mol.fingerprints/#scikit_mol.fingerprints.MorganFingerprintTransformer\">?<span>Documentation for MorganFingerprintTransformer</span></a><span class=\"sk-estimator-doc-link fitted\">i<span>Fitted</span></span></div></label><div class=\"sk-toggleable__content fitted\"><pre>MorganFingerprintTransformer(radius=3)</pre></div> </div></div></div></div>"
],
"text/plain": [
"MorganFingerprintTransformer(radius=3)"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"transformer"
]
},
{
"cell_type": "markdown",
"id": "556858b4",
"metadata": {},
"source": [
"If we want to get all the settings explicitly, we can use the .get_params() method."
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "500dc6f7",
"metadata": {
"execution": {
"iopub.execute_input": "2025-05-08T16:22:31.015226Z",
"iopub.status.busy": "2025-05-08T16:22:31.014511Z",
"iopub.status.idle": "2025-05-08T16:22:31.022448Z",
"shell.execute_reply": "2025-05-08T16:22:31.021051Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"{'fpSize': 2048,\n",
" 'n_jobs': None,\n",
" 'radius': 3,\n",
" 'safe_inference_mode': False,\n",
" 'useBondTypes': True,\n",
" 'useChirality': False,\n",
" 'useCounts': False,\n",
" 'useFeatures': False}"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"parameters = transformer.get_params()\n",
"parameters"
]
},
{
"cell_type": "markdown",
"id": "d453fa33",
"metadata": {},
"source": [
"The corresponding .set_params() method can be used to update the settings from options or from a dictionary (via ** unpackaging). The get_params and set_params methods are sometimes used by sklearn, as example hyperparameter search objects."
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "3a27b07a",
"metadata": {
"execution": {
"iopub.execute_input": "2025-05-08T16:22:31.026293Z",
"iopub.status.busy": "2025-05-08T16:22:31.025546Z",
"iopub.status.idle": "2025-05-08T16:22:31.032975Z",
"shell.execute_reply": "2025-05-08T16:22:31.031700Z"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"MorganFingerprintTransformer(fpSize=256)\n"
]
}
],
"source": [
"parameters[\"radius\"] = 2\n",
"parameters[\"fpSize\"] = 256\n",
"transformer.set_params(**parameters)\n",
"print(transformer)"
]
},
{
"cell_type": "markdown",
"id": "3dd372d3",
"metadata": {},
"source": [
"Transformation is easy, simply use the .transform() method. For sklearn compatibility the scikit-learn transformers also have a .fit_transform() method, but it is usually not fitting anything."
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "0f141920",
"metadata": {
"execution": {
"iopub.execute_input": "2025-05-08T16:22:31.036752Z",
"iopub.status.busy": "2025-05-08T16:22:31.036118Z",
"iopub.status.idle": "2025-05-08T16:22:31.043904Z",
"shell.execute_reply": "2025-05-08T16:22:31.042561Z"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"fps is a <class 'numpy.ndarray'> with shape (6, 256) and data type uint8\n"
]
}
],
"source": [
"fps = transformer.transform(mols)\n",
"print(f\"fps is a {type(fps)} with shape {fps.shape} and data type {fps.dtype}\")"
]
},
{
"cell_type": "markdown",
"id": "9cb75226",
"metadata": {},
"source": [
"For sklearn compatibility, the transform function can be given a second parameter, usually representing the targets in the machine learning, but it is simply ignored most of the time"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "481e527f",
"metadata": {
"execution": {
"iopub.execute_input": "2025-05-08T16:22:31.047228Z",
"iopub.status.busy": "2025-05-08T16:22:31.046700Z",
"iopub.status.idle": "2025-05-08T16:22:31.054569Z",
"shell.execute_reply": "2025-05-08T16:22:31.053275Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"array([[0, 1, 0, ..., 0, 0, 0],\n",
" [1, 0, 0, ..., 0, 0, 1],\n",
" [1, 0, 0, ..., 0, 0, 0],\n",
" [0, 0, 0, ..., 0, 0, 1],\n",
" [1, 1, 0, ..., 0, 0, 0],\n",
" [1, 1, 0, ..., 0, 0, 0]], shape=(6, 256), dtype=uint8)"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"y = range(len(mols))\n",
"transformer.transform(mols, y)"
]
},
{
"cell_type": "markdown",
"id": "500cec09",
"metadata": {},
"source": [
"Sometimes we may want to transform SMILES into molecules, and scikit-mol also has a transformer for that. It simply takes a list of SMILES and produces a list of RDKit molecules, this may come in handy when building pipelines for machine learning models, as we will demo in another notebook."
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "7773a5a0",
"metadata": {
"execution": {
"iopub.execute_input": "2025-05-08T16:22:31.057901Z",
"iopub.status.busy": "2025-05-08T16:22:31.057134Z",
"iopub.status.idle": "2025-05-08T16:22:31.064119Z",
"shell.execute_reply": "2025-05-08T16:22:31.063046Z"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"SmilesToMolTransformer()\n"
]
}
],
"source": [
"from scikit_mol.conversions import SmilesToMolTransformer\n",
"\n",
"smi2mol = SmilesToMolTransformer()\n",
"print(smi2mol)"
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "fa484453",
"metadata": {
"execution": {
"iopub.execute_input": "2025-05-08T16:22:31.067178Z",
"iopub.status.busy": "2025-05-08T16:22:31.066755Z",
"iopub.status.idle": "2025-05-08T16:22:31.074756Z",
"shell.execute_reply": "2025-05-08T16:22:31.073587Z"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[[<rdkit.Chem.rdchem.Mol object at 0x7feda5e663b0>]\n",
" [<rdkit.Chem.rdchem.Mol object at 0x7feda5e66420>]\n",
" [<rdkit.Chem.rdchem.Mol object at 0x7feda5e66490>]\n",
" [<rdkit.Chem.rdchem.Mol object at 0x7feda5e66500>]\n",
" [<rdkit.Chem.rdchem.Mol object at 0x7feda5e66570>]\n",
" [<rdkit.Chem.rdchem.Mol object at 0x7feda5e665e0>]]\n"
]
}
],
"source": [
"print(smi2mol.transform(smiles_strings))"
]
}
],
"metadata": {
"jupytext": {
"formats": "docs//notebooks//ipynb,docs//notebooks//scripts//py:percent"
},
"kernelspec": {
"display_name": "Python 3.9.4 ('rdkit')",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.6"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
================================================
FILE: docs/notebooks/02_descriptor_transformer.ipynb
================================================
{
"cells": [
{
"cell_type": "markdown",
"id": "e3cf34ca",
"metadata": {},
"source": [
"# Desc2DTransformer: RDKit descriptors transformer\n",
"\n",
"The descriptors transformer can convert molecules into a list of RDKit descriptors. It largely follows the API of the other transformers, but has a few extra methods and properties to manage the descriptors."
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "81745b1f",
"metadata": {
"execution": {
"iopub.execute_input": "2025-05-08T16:22:32.631647Z",
"iopub.status.busy": "2025-05-08T16:22:32.631311Z",
"iopub.status.idle": "2025-05-08T16:22:34.194489Z",
"shell.execute_reply": "2025-05-08T16:22:34.193202Z"
},
"lines_to_next_cell": 0
},
"outputs": [],
"source": [
"from rdkit import Chem\n",
"import numpy as np\n",
"import matplotlib.pyplot as plt\n",
"from scikit_mol.descriptors import MolecularDescriptorTransformer"
]
},
{
"cell_type": "markdown",
"id": "2293e9e6",
"metadata": {},
"source": [
"After instantiation of the descriptor transformer, we can query which descriptors it found available in the RDKit framework."
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "dd9a2ad0",
"metadata": {
"execution": {
"iopub.execute_input": "2025-05-08T16:22:34.198567Z",
"iopub.status.busy": "2025-05-08T16:22:34.197421Z",
"iopub.status.idle": "2025-05-08T16:22:34.206453Z",
"shell.execute_reply": "2025-05-08T16:22:34.205342Z"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"There are 217 available descriptors\n",
"The first five descriptor names: ['MaxAbsEStateIndex', 'MaxEStateIndex', 'MinAbsEStateIndex', 'MinEStateIndex', 'qed']\n"
]
}
],
"source": [
"descriptor = MolecularDescriptorTransformer()\n",
"available_descriptors = descriptor.available_descriptors\n",
"print(f\"There are {len(available_descriptors)} available descriptors\")\n",
"print(f\"The first five descriptor names: {available_descriptors[:5]}\")"
]
},
{
"cell_type": "markdown",
"id": "110c00c0",
"metadata": {},
"source": [
"We can transform molecules to their descriptor profiles"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "4431a910",
"metadata": {
"execution": {
"iopub.execute_input": "2025-05-08T16:22:34.210875Z",
"iopub.status.busy": "2025-05-08T16:22:34.210262Z",
"iopub.status.idle": "2025-05-08T16:22:34.353857Z",
"shell.execute_reply": "2025-05-08T16:22:34.352652Z"
}
},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAh8AAAGdCAYAAACyzRGfAAAAOnRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjEwLjAsIGh0dHBzOi8vbWF0cGxvdGxpYi5vcmcvlHJYcgAAAAlwSFlzAAAPYQAAD2EBqD+naQAAaiNJREFUeJztvXt8G+WZ9n/NjA4+xXaOdgwJhGPCGVIILrTbpi4hCyyUvC3w0l1KWdjSwC5JKe/mtwVaegjQXaC0IbR9swG2TSm8b4GFt4RCKKFAkkI4JxASCDjg2CEBn63jPL8/Zp7RjDSSLWkkjeLr+/noY2skzzzSyPNcuu/rvh9FCCFACCGEEFIm1EoPgBBCCCHjC4oPQgghhJQVig9CCCGElBWKD0IIIYSUFYoPQgghhJQVig9CCCGElBWKD0IIIYSUFYoPQgghhJSVQKUHkI6u6+jq6sKECROgKEqlh0MIIYSQMSCEwMDAANra2qCquWMbvhMfXV1dmDFjRqWHQQghhJAC2LVrFw488MCcz/Gd+JgwYQIAY/CNjY0VHg0hhBBCxkJ/fz9mzJhhzeO58J34kKmWxsZGig9CCCGkyhiLZYKGU0IIIYSUlbzERzKZxPXXX49Zs2ahtrYWhx56KH74wx/CvjCuEAI33HADpk+fjtraWnR0dGD79u2eD5wQQggh1Ule4uOWW27BypUr8Ytf/AJvvfUWbrnlFtx66634+c9/bj3n1ltvxZ133om7774bmzZtQn19PRYsWIBIJOL54AkhhBBSfSjCHrYYhbPPPhstLS1YtWqVtW3RokWora3Fb37zGwgh0NbWhu985zu49tprAQB9fX1oaWnBPffcgwsvvHDUY/T396OpqQl9fX30fBBCCCFVQj7zd16Rj89+9rNYt24d3nnnHQDAa6+9hueeew4LFy4EAOzcuRPd3d3o6Oiw/qapqQnz5s3Dhg0b8n0dhBBCCNkPyava5V//9V/R39+P2bNnQ9M0JJNJ/PjHP8bFF18MAOju7gYAtLS0OP6upaXFeiydaDSKaDRq3e/v78/rBRBCCCGkusgr8vHAAw/gt7/9LdasWYOXX34Z9957L/793/8d9957b8EDWL58OZqamqwbG4wRQggh+zd5iY/vfve7+Nd//VdceOGFOPbYY/H3f//3WLJkCZYvXw4AaG1tBQD09PQ4/q6np8d6LJ1ly5ahr6/Puu3atauQ10EIIYSQKiEv8TE8PJzRr13TNOi6DgCYNWsWWltbsW7dOuvx/v5+bNq0Ce3t7a77DIfDVkMxNhYjhBBC9n/y8nycc845+PGPf4yZM2fi6KOPxiuvvILbbrsN3/zmNwEYXc2uueYa/OhHP8Lhhx+OWbNm4frrr0dbWxvOO++8UoyfEEIIIVVGXuLj5z//Oa6//np8+9vfxp49e9DW1oZ/+qd/wg033GA957rrrsPQ0BCuuOIK9Pb24vTTT8fatWtRU1Pj+eAJIYQQUn3k1eejHLDPByGEEFJ9lKzPx7ijtxN47nYg0mfc13Wg+w0gGa/suAghhJAqhuIjF8//DHjq+8Cra4z7Wx8C7j4deGZ5RYdFCCGEVDMUH7mIDho/+z40fvZsNX72shyYEEIIKRSKj1wIo4QYw/uMn0MfO7cTQgghJG8oPnIhRcbQXuOnFCEiWZnxEEIIIfsBFB+5sCIfpviQIkSn+CCEEEIKheIjF1bkw4x4SBHCtAshhBBSMBQfucgW+aD4IIQQQgqG4iMnZv+1+LDR6yPSa9xn2oUQQggpGIqPXNibv+7dbtvOyAchhBBSKBQfubCLjI/ftm1n5IMQQggpFIqPXNjFx563Ur8z7UIIIYQUDMVHLhyRj23u2wkhhBCSFxQfubB7Pig+CCGEEE+g+MiFXWT0daZ+Z9qFEEIIKRiKj1xki3Aw8kEIIYQUDMVHLrKKD0Y+CCGEkEKh+MgFIx+EEEKI51B8FAI9H4QQQkjBUHzkgpEPQgghxHMoPnJB8UEIIYR4DsVHLrKJDKZdCCGEkIKh+MgFq10IIYQQz6H4yIW9w6ljO9MuhBBCSKFQfOQiXWRoYeMn0y6EEEJIwVB85EKKj1CD8bOhxbmdEEIIIXlD8ZELKTLqpxo/G6Y5t/uVrleAJ/4NiPRVeiSEEEJIBhQfuZCej9lnAY0HAnPONu77Pe3yl/8ANvwC2La20iMhhBBCMqD4yIWMcBzWASzdAhx+hrnd5+IjHjF+JkYqOw5CCCHEBYqPnJiRD0V1/vR72kWO2/fjJIQQMh6h+MiFnLwt8aEZP/2edpHjzlYqTAghhFSQvMTHwQcfDEVRMm6LFy8GAEQiESxevBiTJ09GQ0MDFi1ahJ6enpIMvCxY4kMxfqqac7tfscSHz8dJCCFkXJKX+HjxxRexe/du6/bkk08CAL761a8CAJYsWYJHH30UDz74INavX4+uri6cf/753o+6XGREPhTndr/CyAchhBAfE8jnyVOnTnXcv/nmm3HooYfib/7mb9DX14dVq1ZhzZo1mD9/PgBg9erVmDNnDjZu3IhTTz3Vu1GXi6pNu9DzQQghxL8U7PmIxWL4zW9+g29+85tQFAWbN29GPB5HR0eH9ZzZs2dj5syZ2LBhgyeDLTsizXDKtAshhBBSNHlFPuw8/PDD6O3txTe+8Q0AQHd3N0KhEJqbmx3Pa2lpQXd3d9b9RKNRRKNR635/f3+hQ/KejMiHrHbxe+SD4oMQQoh/KTjysWrVKixcuBBtbW1FDWD58uVoamqybjNmzChqf55ieSZMrwfTLoQQQkjRFCQ+PvjgAzz11FP4x3/8R2tba2srYrEYent7Hc/t6elBa2tr1n0tW7YMfX191m3Xrl2FDKk0ZKt2gfC3mZORD0IIIT6mIPGxevVqTJs2DWeddZa1be7cuQgGg1i3bp21bdu2bejs7ER7e3vWfYXDYTQ2NjpuviFb2sX+mB+xxuZjgUQIIWTckrfnQ9d1rF69GpdccgkCgdSfNzU14bLLLsPSpUsxadIkNDY24uqrr0Z7e3t1VroAyNrhFDBSL1YkxGcw8kEIIcTH5C0+nnrqKXR2duKb3/xmxmO33347VFXFokWLEI1GsWDBAtx1112eDLQiVHvkw89jJIQQMm7JW3ycccYZEFn8DjU1NVixYgVWrFhR9MB8QVbPB/xd8cImY4QQQnwM13bJRbYmY4DPK15Y7UIIIcS/UHzkomrTLhQfhBBC/AvFRy6ydTgF/D2x0/NBCCHEx1B85CJdfKRXu/gVig9CCCE+huIjF+mTt6LA6nbq54mdhlNCCCE+huIjF+meD8C2uBwjH4QQQkghUHzkwk18KFWwsi0Np4QQQnwMxUdO0jwf9t/p+SCEEEIKguIjF1WfdqHngxBCiP+g+MhFzrSLjyd2pl0IIYT4GIqPXKS3V7f/zrQLIYQQUhAUH7mo+rQLxQchhBD/QfGRDXtaxc1w6ueJ3Rqbj1NDhBBCxi0UH9nIKj7MyIef0y5cWI4QQoiPofjIhn3itns+mHYhhBBCioLiIxuOidtuOK2itIufx0gIIWTcQvGRDUfkw63JmI8ndooPQgghPobiIxvZxEdVpV1oOCWEEOI/KD6yUs3VLjScEkII8S8UH9nImnapgmoXig9CCCE+huIjG6NWu/h4YmfahRBCiI+h+MjGaIbTqvB8+FggEUIIGbdQfGRjtA6nrHYhhBBCCoLiIxvZxEdVpV18PEZCCCHjFoqPbGTzfDDtQgghhBQFxUc23Fa0Baqj2oVruxBCCPExFB/ZsCZuxbm9qtIurHYhhBDiPyg+spE18uHztItdcPhZIBFCCBm3UHxkxZzEs4kPv6ZdHIKDkQ9CCCH+g+IjG6NGPnw6sdvFByMfhBBCfAjFRzayiQ+/LyxH8UEIIcTnUHxkwxIfaYZTv1e70PNBCCHE5+QtPj766CN8/etfx+TJk1FbW4tjjz0WL730kvW4EAI33HADpk+fjtraWnR0dGD79u2eDrosiFE8H36d2Bn5IIQQ4nPyEh+ffvopTjvtNASDQTz++OPYunUr/uM//gMTJ060nnPrrbfizjvvxN13341Nmzahvr4eCxYsQCQS8XzwJSVb5INpF0IIIaQoAvk8+ZZbbsGMGTOwevVqa9usWbOs34UQuOOOO/C9730P5557LgDgvvvuQ0tLCx5++GFceOGFHg27DIwW+fBt2sUuPnxqiiWEEDKuySvy8d///d/4zGc+g69+9auYNm0aTjzxRPz617+2Ht+5cye6u7vR0dFhbWtqasK8efOwYcMG131Go1H09/c7br5g1GoXn0YVGPkghBDic/ISH++99x5WrlyJww8/HE888QSuvPJK/PM//zPuvfdeAEB3dzcAoKWlxfF3LS0t1mPpLF++HE1NTdZtxowZhbwO76nWDqeMfBBCCPE5eYkPXddx0kkn4Sc/+QlOPPFEXHHFFbj88stx9913FzyAZcuWoa+vz7rt2rWr4H15ymhru/hWfLDahRBCiL/JS3xMnz4dRx11lGPbnDlz0NnZCQBobW0FAPT09Die09PTYz2WTjgcRmNjo+PmC0ZLu/jV8wGKD0IIIf4mL/Fx2mmnYdu2bY5t77zzDg466CAAhvm0tbUV69atsx7v7+/Hpk2b0N7e7sFwy0kWwymrXQghhJCiyKvaZcmSJfjsZz+Ln/zkJ/ja176Gv/71r/jVr36FX/3qVwAARVFwzTXX4Ec/+hEOP/xwzJo1C9dffz3a2tpw3nnnlWL8paNq0y4UH4QQQvxNXuLj5JNPxkMPPYRly5bhpptuwqxZs3DHHXfg4osvtp5z3XXXYWhoCFdccQV6e3tx+umnY+3ataipqfF88CUla4dT875f0y4UH4QQQnxOXuIDAM4++2ycffbZWR9XFAU33XQTbrrppqIGVnFM4+ZIQkBL6AgFjAjIUAKoB/w7sVN8EEII8Tlc2yUb5sS9ZyCGh1/5CADw5NYe/OGV3Y7HfQdLbQkhhPgcio9smBO3DgUf9Y4AAN7a3Q8dfk+72AUHxQchhBD/QfGRDTOCoEPFUDQBABiKJqDLt4zVLoQQQkhBUHxkw5y4BRQMxUzxEbOLD59O7BQfhBBCfA7FRzbs4iNqRDmGokkk4fMmY+xwSgghxOdQfGTDSrsoWdIuPp3YGfkghBDicyg+spIynDrTLqbh1K8TO8UHIYQQn0PxkQ0r7aJaaZfBaki7cG0XQgghPofiIxsuaZdhpl0IIYSQoqH4yIZp3BRAKu0STUAXMu3i08iHQ3xUbhiEEEJINig+suHo8yHTLgn/p10Y+SCEEOJzKD6ykdbnQwiB4ViSaRdCCCGkSCg+smFrry4E0DcSR0IXrHYhhBBCioTiIxu2tAsA9PRHAaAK0i723yk+CCGE+A+Kj2xYaReDPQMRAGDahRBCCCkSio9spEU+9piRj1Taxa+RD4oPQggh/obiIxs2wykA7BmolrQLxQchhBB/Q/GRFdNwKqT4qMK0Cxt9EEII8SEUH9mwVbsA9rRLFYkPQfFBCCHEf1B8ZCMj7WJEPph2IYQQQoqD4iMbWTwfvu/zwYXlCCGE+ByKj2xk9PkwPR9Cpl0Y+SCEEEIKgeIjG2mej0jcmMiZdiGEEEKKg+IjG1bkQ3Fs9n3aRTDtQgghxN9QfGTD8nw43yKZhhFMuxBCCCEFQfGRjbT26pJU2sWnEztLbQkhhPgcio9spBlOJTLtInzr+XCsLEcBQgghxHdQfGTFaTiVVFXaBaD4IIQQ4jsoPrIh3MVHVaVd3O4TQgghFYbiIxtpTcYk/k+7UHwQQgjxNxQf2cgqPsy0C8UHIYQQUhB5iY/vf//7UBTFcZs9e7b1eCQSweLFizF58mQ0NDRg0aJF6Onp8XzQZSFLn49kNS0s53afEEIIqTB5Rz6OPvpo7N6927o999xz1mNLlizBo48+igcffBDr169HV1cXzj//fE8HXDayVrv4vL16OhQfhBBCfEYg7z8IBNDa2pqxva+vD6tWrcKaNWswf/58AMDq1asxZ84cbNy4Eaeeemrxoy0npuE0Pe2iqZrxS7WkXTI6lRBCCCGVJe/Ix/bt29HW1oZDDjkEF198MTo7OwEAmzdvRjweR0dHh/Xc2bNnY+bMmdiwYUPW/UWjUfT39ztuvkBGPkRKfIQCKgIBQ68Jv0YUmHYhhBDic/ISH/PmzcM999yDtWvXYuXKldi5cyc+97nPYWBgAN3d3QiFQmhubnb8TUtLC7q7u7Puc/ny5WhqarJuM2bMKOiFeI6L4bQ+pEFRq2hhObf7hBBCSIXJK+2ycOFC6/fjjjsO8+bNw0EHHYQHHngAtbW1BQ1g2bJlWLp0qXW/v7/fHwLEZjgNBVTEEjrqwwGImOZ43HewyRghhBCfU1SpbXNzM4444gjs2LEDra2tiMVi6O3tdTynp6fH1SMiCYfDaGxsdNx8ga3JWGONodHqQwGoKqtdCCGEkGIoSnwMDg7i3XffxfTp0zF37lwEg0GsW7fOenzbtm3o7OxEe3t70QMtN8JKu6iYUBMEANSHNcD3htO0SAfFByGEEJ+RV9rl2muvxTnnnIODDjoIXV1duPHGG6FpGi666CI0NTXhsssuw9KlSzFp0iQ0Njbi6quvRnt7e/VVusAQHwqMyMcEGfkIB6Ao1ZZ28ek4CSGEjFvyEh8ffvghLrroIuzbtw9Tp07F6aefjo0bN2Lq1KkAgNtvvx2qqmLRokWIRqNYsGAB7rrrrpIMvNQIPWU4nWBLuyRUn/f5oPgghBDic/ISH/fff3/Ox2tqarBixQqsWLGiqEH5AblqrQ4FE8Iy7RJAv2q8ZYpfJ3UaTgkhhPgcru2SBaGnmoyl0i42z4dvxQc9H4QQQvwNxUcWhK29+smzJkFTFcw9aCJUzXjLFKZdCCGEkILIu736eEHY+nyce0Ib/u74NtQENTz+F1nt4tNJneKDEEKIz6H4yIaeEh+qoiAYMCMeZtpFgV8ndaZdCCGE+BumXbKg26pdNCXVYl21PB/Vknah4ZQQQoi/oPjIhikuBBTYtAeUqqt28ek4CSGEjFsoPrIgRKraRbGpDyvt4tdJneKDEEKIz6H4yILVZExxvkVybRcFwp8pjQyx4cMxEkIIGddQfGTBWtslTXzIyAcAf67vwsgHIYQQn0PxkQUZ+VCgOLZrmmZ/UjmHNDbSAx1+HCMhhJBxDcVHNkSq2sWOotmqk/1Y8cLIByGEEJ9D8ZEFmXZBhueDaRdCCCGkGCg+smBVu2QxnBoP+nBip/gghBDicyg+siGjGunio+rSLqx2IYQQ4i8oPrKQSrs4PR+OtIsfJ3ZGPgghhPgcio9sWIZT51uk2dMufvR8ZKzt4kOBRAghZFxD8ZEFWWqbHvnQNA1JYW6rirQLIx+EEEL8BcVHVmTEwPkWBTQFSbnNjxM7xQchhBCfQ/GRBau9upqedlFSqRg/pl3S0ywUH4QQQnwGxUc2rEk7LfKh2iMffhQfjHwQQgjxNxQf2TAnbSXd86Ey7UIIIYQUA8VHFlILyznFR0BVUi3XdR9O7BQfhBBCfA7FRzakdyKjw6nf0y4stSWEEOJvKD6ykcPzocvIhx+jChljovgghBDiLyg+smB1OM2odlGh+7rahWkXQggh/obiIxvWpJ3p+fB32oXigxBCiL+h+MiGrHZx6fNRVWkXP46REELIuIbiIxumUVMobp4PmXbx4cTOJmOEEEJ8DsVHFqTnQ8lYWE5BUvi4z0fGwnJ+HCMhhJDxDMVHFhSrz0fm2i6ptAs9H4QQQki+UHxkQZjpC0VN73DKahdCCCGkGIoSHzfffDMURcE111xjbYtEIli8eDEmT56MhoYGLFq0CD09PcWOs/yMqdrFhxN7hvhgnw9CCCH+omDx8eKLL+KXv/wljjvuOMf2JUuW4NFHH8WDDz6I9evXo6urC+eff37RAy07OatdqqnUluKDEEKIvyhIfAwODuLiiy/Gr3/9a0ycONHa3tfXh1WrVuG2227D/PnzMXfuXKxevRovvPACNm7c6Nmgy4LVXl1zbHZWu1SD+PBhdIYQQsi4piDxsXjxYpx11lno6OhwbN+8eTPi8bhj++zZszFz5kxs2LDBdV/RaBT9/f2Omy+Qk7biUu1iGU59GFVgqS0hhBCfE8j3D+6//368/PLLePHFFzMe6+7uRigUQnNzs2N7S0sLuru7Xfe3fPly/OAHP8h3GKVHpl2clg8E7IbTqki7UHwQQgjxF3lFPnbt2oV/+Zd/wW9/+1vU1NR4MoBly5ahr6/Puu3atcuT/RaLYkU+nGkXjWkXQgghpCjyEh+bN2/Gnj17cNJJJyEQCCAQCGD9+vW48847EQgE0NLSglgsht7eXsff9fT0oLW11XWf4XAYjY2NjpsvsDwfmX0+/F3twrQLIYQQf5NX2uVLX/oS3njjDce2Sy+9FLNnz8b/+l//CzNmzEAwGMS6deuwaNEiAMC2bdvQ2dmJ9vZ270ZdFmTaJdPzkWCTMUIIIaRg8hIfEyZMwDHHHOPYVl9fj8mTJ1vbL7vsMixduhSTJk1CY2Mjrr76arS3t+PUU0/1btTlQEYQ1Mw+H7FqSrukt1snhBBCKkzehtPRuP3226GqKhYtWoRoNIoFCxbgrrvu8vowpSeX50P4eFVbru1CCCHE5xQtPp555hnH/ZqaGqxYsQIrVqwodtcVRbGqXdIjH6rPPR9sMkYIIcTfcG2XrGT3fLDahRBCCCkcio9sZKt2Ue2r2vpwYqf4IIQQ4nMoPrJhrWrr1uG0GpqM+VAgffoBMLSv0qMghBBSYSg+sqDAvb16QLMvLOejiV1iVemYRlm/jDE6ANx1KvCfCyo9EkIIIRWG4iMboso9H2rAeb/SDO0F4sNAb2elR0IIIaTCUHxkQZFpF9dqF2Ob7kvxISMfAef9SiPfKz1R2XEQQgipOBQfWZGej+xruwjdJ1EFO1bkw2dpF+mPEUn/CCJCCCEVgeIjC1afDxfDqRQfuh+/xfs17WJ/r7weUzIOrP5bYO0yb/cLAPEI8L87gKdKsPJydBD41ReAZ27xft/p6Drwm/8BPLy49McihJBRoPjIShbDqa3aRU/6Me2S1pnVL1EGe4rK63TVvh3AB88Dr67xdr8AsGcr8OGLwGv3e7/v3a8BXa8Ar5dg3+kM9gA7ngRe/W3pj0UIIaNA8ZEF6flQ0zwfjsiHn8WHryMfHr9vct+l8OCU0qtijbsMETTrGMKIghBCSAWh+MhCKu2S5vlQUmu7+NNw6lPxYR+H15OtbvOTeI0ow77LIQbs4/djfxpCyLiC4iMrchJ3vkWqqkBXpOHUh54PubCc6rNeJKVMu1RrdKKcFUCO99+Pn1tCyHiC4iMLMtmSXmoLAAJGNET3Y/jar5EP+4TntfiwIgilTLuU4H0sZcQm27HSfyeEkApA8ZENc9JWFS3zMVOQCF96PtL6fMAnhtNShv2lsClFGW9JIx+V8HyU6XiEEJIDio8spNqru0Q+FBn58KP4GIeRD/v+vH69cn/7lefDJ58JQsi4heIjC1a1i5oj8uFn8aGU2fMR6Qc2/QoY6HZ/XC+l4bSE3+oZ+SCEEM+h+MiCksVwCqQiH74WH+WOfLx2P/D4d4Hnf+b+eClLbUUZzKxCL0FKp4RRlWzHAuj5IIRUHIqPbMi1XVzFh6x28eFFvFJru0R6jZ8jve6Pl0MgACWOqpSqPwkjH4SQ8QXFRxYUubaL4vIWMfKRyWgTadk8HyWMqpRq3+X4HLHPByHER1B8ZEGmXdw8H/6OfFRoYbnRelaUUiA4hI3Hr7ekURW57zJ0HS2l+COEkDyh+MiCNJy6VbtYkQ8/Vg1UTHyMEvkoZYdTUQ6BgNKlXYDSRyPY54MQ4iMoPrKQiny4pV3MyIef+3yUe2G50dZXqda0Sym9EqKMJlB2OCWE+AiKjyxYng+3tItaDWmXMns+5HEqknYp4cRayv4Y5TSB0vNBCPERFB9ZUHJUu/g67YL0ahefpF3KUTVS8n2XMKVT8rQLq10IIf6B4iMLMu3iWu0iBYmvIx9+83yU0HNQ0jLeEqZGymkCdUSH/CiaCSHjCYqPLMi0i5Yr8uFr8VHuyMcoZaNlqRpBdZbaAqUXH0y7EEJ8BMVHFtSx9PnwY9olQ3yUy3AqxUc89+NAdZlCy5V2KXUqhIZTQoiPoPhwwzZhuxlOFaZdXI47Wp+PEqYYSlk1Usp0RVk9Hyy1JYT4B4oPN2yTmaK69fmQi7b58CJuLSw3jjwfVRv5KKMJlIZTQoiPoPhwwzZhu3U4VawOp35Ou8hx+6XPR5lKbb0WW/b90fNBCCGeQPHhhiPy4VbtYvop/Bi+llqjYobTMfT5qKYIQtn2Xc5SWx9+bgkh44q8xMfKlStx3HHHobGxEY2NjWhvb8fjjz9uPR6JRLB48WJMnjwZDQ0NWLRoEXp6ejwfdMmxRz5cS22ryXDqE/FRtaW2ZSrjpeeDEDKOyEt8HHjggbj55puxefNmvPTSS5g/fz7OPfdcbNmyBQCwZMkSPProo3jwwQexfv16dHV14fzzzy/JwEuKzXDq1l5dRkMUP4avK9bhNA/DaTV1OK3WBmYZx2K1CyHEPwTyefI555zjuP/jH/8YK1euxMaNG3HggQdi1apVWLNmDebPnw8AWL16NebMmYONGzfi1FNP9W7UpcaedtHcPB/mNj9+g7TEh+q8X2ry8XxU09ou7PNBCCGeU7DnI5lM4v7778fQ0BDa29uxefNmxONxdHR0WM+ZPXs2Zs6ciQ0bNmTdTzQaRX9/v+NWceziQ8kUH3JiZ9rFxn7bXr1KoyqVPBYhhIxC3uLjjTfeQENDA8LhML71rW/hoYcewlFHHYXu7m6EQiE0Nzc7nt/S0oLu7u6s+1u+fDmampqs24wZM/J+EZ5jm7A1t3fIFCTVkXYpd5OxbJ4Pey+OEi7+VlWeD/b5IISMT/IWH0ceeSReffVVbNq0CVdeeSUuueQSbN26teABLFu2DH19fdZt165dBe/LMxzVLi6RD82Y2BVf5s5NsVH2Ph/jwfPBDqeEEOIFeXk+ACAUCuGwww4DAMydOxcvvvgifvazn+GCCy5ALBZDb2+vI/rR09OD1tbWrPsLh8MIh8P5j7yUOAynbmmXIABGPpzHNd+L5FhKbasoguDwSnjdQ4SeD0LI+KToPh+6riMajWLu3LkIBoNYt26d9di2bdvQ2dmJ9vb2Yg9TZmziw6XUVlF9HPkYj6va7hd9PtjhlBAyfsgr8rFs2TIsXLgQM2fOxMDAANasWYNnnnkGTzzxBJqamnDZZZdh6dKlmDRpEhobG3H11Vejvb29uipdAGvC1oUCVXNpry7Fh/DZRdwe5fCz4bSkVSNer79SpnVj6PkghIwj8hIfe/bswT/8wz9g9+7daGpqwnHHHYcnnngCX/7ylwEAt99+O1RVxaJFixCNRrFgwQLcddddJRl4SZHiAwpUJVN8KKbnQ/XbRdwuNMoe+TCPk1V8lHISr9LoBKtdCCHjlLzEx6pVq3I+XlNTgxUrVmDFihVFDarimBO2gALNbWE5K/Lhs4u4Q3xUKPIBYQiN9OZsJZ3Ey7T+iuf7LqEgy3Usv31uCSHjDq7t4oYj8pH5sIx8+DvtUubIhyP14fK+lHISr9boBD0fhJBxCsWHG+YkLrKkXWS1i+q3b5C+iHzAdXITyRJO4uzzkd+xmHYhhFQYig83rMiH6io+VOn58F3kwyY0rD4f5Woyllt8jERjtsfZ4TRjf/R8EELGERQfbtjSLq6eD62KIh8ol/jI3cFUlKvJGD0fWY7FPh+EEP9A8eGGzXDqmnWxIh8+u4hXtNol9zdrYdsmsjUiK/jY1drhlKvaEkLGJxQfbtg8H26RD4Xiw+XY9sktnvmwbfLTfb62S1K3RYsck7Z372VSF/R8EELGLRQfrhiTT7Y+H6qZ0lDhs2+QPjac2tuuex35SCZsYqfISXxb9wBOuOlPWPnMu8aGEkQMOvcN46QfPondvUO2fZfT8+Gzz221US4fFSH7MRQfbozaZMzwfGh+i3zYKbv4yD1JC9t7pSe9fd8iMZv4KHISf21XLwYiCWx4b5+xoQReiTc+6kPfSBzDkWhqY6kFQSnXqBlPvLgK+OlhQPeblR4JIVUNxYcbjmqXzIfVQBUYTiu1qm3679a21OTqddrFyzLeuJlaSSRdOrZ6NO6EeQylnIKAkQ9v2PEUMLwX2LWx0iMhpKqh+HDDMpwit+cDfhUfSoUNp27VLiU0nI7W4CwPEknh+FkKr0Tc3Lfq4bhHpZTt7ccTSbNk3OvPMCHjDIoPNyzxkaXPh4x8yFbifkEKDUUBoDi3lfzYo0yktm2ilH0+ioxGxc2IR9xtrRqPxi2jKirKKAgY+fCGpJniczFVE0LGDsWHG8JmOHWJfGim58N4ko8uQpb4UGHVCJfDGyfE6JNbCSMfXkZVZKWLVfFSgjVREnoFIh/s8+EN8jwlffR/T0gVQvHhxihru8jIh/EkH32LdIgP1bmtHMeVuHyLVxyltqWLfCSLTbuYwiBupV1K4Plwi3yUvNSWHU49wYp88D0kpBgoPtywGU41l7RLIGBbDNhX4sOcMMstPtIvxKMsLCe8fs/skY9EcfuWaZeU4dR7z4d75IN9PqoCnWkXQryA4sMN2WRMKFBcPR+h1B0/XcgrFflIFxOjrWpbQsNpsVEVy3Cqly7yYRlOy+r5YIdTT0gy7UKIF1B8uCAnsGzVLkFNgy7M7X66CFUs7ZI2cbq9J/bohMcTreLwfBR3PqTRVEZAStEfQ0ZVNNDzUXUw8kGIJ1B8uCDFR7Y+HwFNRRxmKaufvkXaS20rGvlw8XyUNO3iXSVNMplmOC1BxMBKu1TM8+Gjz2y1wVJbQjyB4sMF3W44dVEfAU1B0pfiw7+eD6WUngPba9SLnBQyDael8HyYkY+KeT58VB5ebcjPl5/+7wmpQig+XBB6amE5tz4fQVVFQr51froI2ft8KGXs85G34dTjtIvwLvJhGU5d+3x4Ve1Cz0fVwrQLIZ5A8eGCPe3iWu3i18gHXCIf5Wj0MQbDqVLCnhaOlE6xkY/0Dqcl8ErIqEqAno/qQ3qKmHYhpCgoPlwQNsOpi/ZAUFN8HvmwNxkrg/hIn8xGFR/eRmMUD82sMu2Sinx4nxpJ6pXu8+Gjz2y1wcgHIZ5A8eGCrssOp6prtUtAVZGA2evDTxdy35TaZk6k9p4WwuuJ1vYaizWzJqyF5Uq4tosuIx/2tEup13Zhnw9PYKktIZ5A8eGC0OXaLu6eDyPt4ufIR7mrXfKNfHj7nqkOz4dHaRddQIylbXxBx9ChQIeq2KJSpTaBssOpN1iRDx/93xNShVB8uKAL6flwb68eUFUkhPR8+OhCXrHIx1jER+nMlV7u2+rvATMFUwKvRCIpoCHtvJTc81HGFM/+ihC29uoUH4QUA8WHCzLyocO9w6k98iFk3b8f8E3aJfPCbE+72KtTvMCxv2L7fOjC+XspVrXVXcQHPR/+R0/CMnAz7UJIUVB8uKBbIXCXsAeMUtu46flIFrmWiKdUqs9HhuE0d5Mxzw2nHno+4jbxEU8/tx72+Sh75IOej+Kxm0xpOCWkKCg+XEhVu7i/PfbIRzLho4uQJT4q3eE08z1xtBL3+Fu+6mHaJWFPu8TTXpeHpbaZ4qPEgoCltsVjj3aw1JaQoqD4cMGqdnGrs4UhPhJmn4+kn8Kv9rQLylhqO5rnIy3SoXg5+QkB1cN+GVaVC4B4Ii2l5qHh1CHGgDI0GaPhtGiYuiLEMyg+3BCy2sX97QmqairyEfep+LAiH34QH6VJXwDIjOwU3efD3qq9NAKhMp6PMnZT3V9JMu1CiFdQfLig20pt3VDVVIdT3VdpF/vCcuVsrz6KuEibWD2NfKQfu8h9J2yej0T6ufVKfFSi2oXf2ovHLjj8FPEkpAqh+HDBaoKVJe0CwGoy5tu0i5+qXdLueys+vI1OxJM5xIdXpbauhlN6PnyPI/JBAUdIMeQlPpYvX46TTz4ZEyZMwLRp03Deeedh27ZtjudEIhEsXrwYkydPRkNDAxYtWoSenh5PB11qxCiRDwDQFTPy4SfjWaXEx2jt1fXyRT6Uoj0fqfcro5LJowknnhTQlDEsxucl9HwUj+09FH760kFIFZKX+Fi/fj0WL16MjRs34sknn0Q8HscZZ5yBoaEh6zlLlizBo48+igcffBDr169HV1cXzj//fM8HXkqsVW2V7G9PSnz46SJUoVLb0TwfaWNQPPV8pKd4inu9jj4fJSq1Tbp6Pkp4noRw7p/iozBsPX1GItEKDoSQ6ieQz5PXrl3ruH/PPfdg2rRp2Lx5Mz7/+c+jr68Pq1atwpo1azB//nwAwOrVqzFnzhxs3LgRp556qncjLyH6KKW2AKD72fNRcfGRW4wo6ZUeXh67yKhK3GbMzEipeRb5KHPaZQwdaMkYsH0e/PWlg5DqoyjPR19fHwBg0qRJAIDNmzcjHo+jo6PDes7s2bMxc+ZMbNiwwXUf0WgU/f39jlulEWLsaRd/9fnwq+fDOfmpJfR8FJ92sXs+vI2qWPt1i3yUUhBkRIcY+SgI2zlSWO1CSFEULD50Xcc111yD0047DccccwwAoLu7G6FQCM3NzY7ntrS0oLu723U/y5cvR1NTk3WbMWNGoUPyjNGajAGArgTM5/roW6QlPmAzy5aj1DZfw6mHgsjjfdsNp8mS9vkoY6ntGNrfkzFgi3aoHi8RQMh4o2DxsXjxYrz55pu4//77ixrAsmXL0NfXZ9127dpV1P68QNg7hWbB8nywvXrmMdJD0qUstc3Yd3HnI2lLu4iS9vkoYyokI+1Shs/E/ogt2qFSwBFSFHl5PiRXXXUVHnvsMTz77LM48MADre2tra2IxWLo7e11RD96enrQ2trquq9wOIxwOFzIMEpGKvKRXXwIVQOSPnO9u6VdAEOU5BBSRTNan4+Spl28jXw40i7plUyeRT7c0i4lFASMfHiD7X/dUwFNyDgkr8iHEAJXXXUVHnroITz99NOYNWuW4/G5c+ciGAxi3bp11rZt27ahs7MT7e3t3oy4DMjIR+5qF0O3+cp45hb5AEof/Ri1w2ladCJ94i3q2N62brcbTjPMxB72+QiUM/KRfv45cRaG7Rwx7UJIceQV+Vi8eDHWrFmDRx55BBMmTLB8HE1NTaitrUVTUxMuu+wyLF26FJMmTUJjYyOuvvpqtLe3V02lCzDGyIeZdhG+7fOhpG3XSnfcPD0fpY18FLuwnM3zkRH58G5hOZWej+rDVmqrUXwQUhR5iY+VK1cCAL7whS84tq9evRrf+MY3AAC33347VFXFokWLEI1GsWDBAtx1112eDLZc5BP58KfhtNyRj1EmN3NijQsNQSUJFcK7VJDwLqUjhHC0V89oIOdZe3UdAaWM1S4Z5ch66VNx+yP2tAuE8b6qJRT1hOzH5CU+xBgWKaupqcGKFSuwYsWKggdVaYR1sc5xcVZ9HPmAUl7xkbG4m/ukHUMAQZlu0JOAVpDlKOexivF82BuMAW7iwyPPh+4S+Sin5wPw7v0fT6S/j8k4xQchBcK1XdyQfT5yRD6EGfnw1eqWvol8uBtO43at69U3/Qwza+H7TWSIj9L0x0joosyeD5dx0/eRPyVqOkfIeITiwwXdivDkqnYx0y5+jHwoKhxjH0PEqihGba8uIx/BjG1eHTsmjG+gxZhZ40nn31pmYjXoOFYxCCGQtEU+klLEltTzYe5btb3/nDjzJ/2Lhp++eBBSZVB8uKGPHvmAKiMffrqI2/qTVCLyYU3S7umKmCPy4dFkmyZsivF82M2mAJCUkY+AWQruwZhlE7OAFB9qyNx3GTwfgXDmNjJmMsrq/fTFg5Aqg+LDBTEG8WFFPvwkPipWamu+B4Ea533rcWOii4oSfPNOEzYqRMH+ifS0ixVm17wTCNJXolmRD++iKlmR45avw76NjJmMpRQY+SCkYCg+XBBj+fYsIx9++vZTKc+H3H8gyyRt3nd4Prwak+5dSieRJlr09IiBB6kR2UdEdjhNRT5KKD7kuO3ioxydb/czMtrt+6nHDyFVBsWHGw7vRJanKD5Mu1jjTk+7lNrzkTvyIauHEtCgC8X1OcUeOyaKT+mkp12sahcPIx/yGDLykZCRj3L0+dCCsLxAfvrcVgl6PJq2ge8hIYVC8eHCWPp8QDNL7Px0AcrZZKyE6GnfrNPeEzmJJ6EiIT9ynnk+jNfmiHwUeE7SDaeWmdjhlSi2fbuMfKSJj5J6Pswxq5rNq0TPR75kpl189L9PSJVB8eGCsCaYXH0+ypCrz5cM8aE4t5eKUSIfSZv40OVHzrNqF+n5KD7tkt7nw/LzaHbxUdyEk9DTIx9lSLvIMStaqi8FJ868YdqFEO+g+HBBjKHPh2JexBU/mc7shlP7z7KJD/eqkGTCHvnwePKzNTBL35Yv8fS0i/W67F4Jb9q3a4qZirKnXUqVHpNjVgOpyAf7fORNxlo/fvrfJ6TKoPhww5yslRztpxXNz54PKT7k+Evs+bAMp1J8pEc+jIu0bo98eNXR08My3nTDKWSprYeRj5Th1PgZV+wRmxKJRKsUWjOiH4C/InZVgp4R+fDR/z4hVQbFhwtWqW2ut8dMu/hqae0M8VGpyId72iUhNCQt8eHNhVtWpCSFhoQobt8ZkQ+rz4e9RNWjyIdMu3jgVRkVq8mYPe3io89tlZDR58NPXzwIqTIoPlwQSEtfuOHHJmP2ahegjOJDTtLung894WI49Ui0SRNgEkpK2Hjk+bDC6pp3zbkSuSIfpRIEcr/0fBQF0y6EeAfFhxty8sqRdlHNtIvip2+QIs0o6xPPhxWdcKRdvHnfUpU0GpJF+kkS6dUuVn8M79rCJ6wOp3K9mzJEPuj58ITMDqcUH4QUCsWHG/pYSm1lO28ffYOsmOF0tMiHjE54bzi1V9IkixQ28fRql6SbV6LYahfjXMi1XRxelVIJAlfPh48+t1WCSKZ5PvgeElIwFB8uCGuyzlHtIiMfvhIf2TwfpTacphkz074RyqqRJDTowltB5Ix8FCc+0iMfKa9EwLP+GOlru8RLsd5NOo7XIcUHO5zmCz0fhHgHxYcb6d4JFxTVh2mXdK+KUu4+H1lKbZOptIvnhlO3yEfB7dXTPR/eeyWkr0S1DKcaUl1HSxz5UFR6PoohTXxkiBFCyJih+HBBWJNOrshH8auoek7Fql1yp12ELe1SbHQiHauSBqoHno+0tIuMajm8EsW9l7KLasrzoZXevCzHTM9HUaSLjWQ8luWZhJDRoPjIRQ7xoQXMVVSZdrFFPrK0V7dFEIqNTqQjfRm6B8Ima58P1bZQX7GeD1PgqIq5qq1QUtEIej78TVp1S0bHU0LImKH4cENPm8RdUKw+H8VfxONDnxa9DwAu6aJypV1GMZzajJtSIAiPGjTpVuSjeGGT3uejFJ6PjFJbUYb1VmyvQ5hC50ePvQk9Pc1EcqKkfWYpPggpHIoPN8zJS+TyfFiRj+ImjFceug3Bnx6M1x//30XtB0Dl0i7pS7antQqXzboUNWClRpIeiw8dKpKiuMhHMj3yIYWlh54P6SuRhtOkUErfddS2toswj7Wjuxd7h6I5/ohkkBb5SCQYPSKkUCg+XJDVLkqOheU0jzwf8V0vAwBi76wraj8AKthkLG1hOcAxkQp50bZFPnSPxIcwhU1C2Mt4vYp8eO+VsNIuVrWLWvq0i9XnQ4Nuig8NOqJxVrzkg2J6PqLC+Cww8kFI4VB8uJHeL8MFVYoPFDdhqIkRAEDD4PtF7QeAD/p8uK+BIhyRD2NMGcuTF3po8zhC0WwNzIprMqaphnizUmoerokiDadBc2G5uChDBYqtvbp8jwLQEYnTdJoPinl+IjAifBkdTwkhY4biwwUhJ3E1l/gwvv1oRXo+1KQR+m6Jf1jUfgD4YFXbmsxtSPk7FM0W+fAoxWBvBFZs63aZEqkNpgkNR4lqsZ4P4xghVd5XS+L5iMSTWPrAq/h/r+92lAzLyIcKHRFGPvJCEYbYGIEhsjMWmiOEjBmKDzfG0GRMNSs7ik27aMkIAGAi+tG/b09R+6pctUvuyIduMzzqVuTDW8+H4sG+ZdqlJmjsR3GU2nrk+TAjH2HVlnYpQQXKi+9/gj+8/BF+8ecdtmqXgC3ykUQkwchHPsjIx4hg5IOQYqH4cEGx+iLkEh9m5KPItEswOWL9vnvn60XtK3NtlzJVuwg38WH3fKQMj9Jw6pnnQ7dHPoozs0rDaTiQ1gVU1bzzfJiRj6Bq/Iw7Sm29O09DUeM9GI4lHJ4P+f6rTLvkjSyrj8jIB5uMEVIwFB9u5GU4LW7CCOipioP+D98qal9ZIx8oU58PNWD7Fh93eTzlOfCq2kU2hFO11Lf6QicFGfmoDRmvQbEvyObV2i7mMWTkw0i7eB/5GDGFxUgs6Xj/kw7PRxGf3f6ucbewmiojH/R8EFI0FB8uiPRJ3AXZZCyA4iaMkIhYvyf3vFPUvjLFR5n7fDj6Ydg8H3bDoxIw/8TbahdFDViej2SB6QTZg0N6Pizx4Si1LbLDqfn3ZmYHMVEaz8dIzDjOSDxp61uTEh+qolsCJW96tgC3zQEeWezFUKsGGfkYEUbkg+3VCSkcig9XRq920YKm56PYtItImdZCfTuL2lfm2i5lrnZRVFfxYf/mLVcK9i7tEjd3bY+qFDYpJDI8HynR5J3nwzScKrJEWCmJ50MKi0g86YhMJYTN81Go+Ph4m/nz7WKHWVVY4sOMfFB8EFI4FB9u6KN7PmTaJQC9KENn2Bb5mDjyQcH7AVDBtV1c1kBxeD6kOAmUQHzIhnCBov0k0o9Rkx75KIXnQzE9H/ZqFw/7fEhhEU+KtA6z9j4fBR4vPmz8jA0XO8yqQjOrXej5IKR4KD5cGUPaJejNUug1tshHW3J3cVUglWoyZvdGuEUIknbPh2YOyaOJ1krpqLZql0IjH8b7lCk+7J6PYpuMybSL8TMmlJTI9bjUNnVMW+TD9DEVVWorRUd8vIkP00cjGPkgpFgoPtwwIx9KLvEhW4kDGW2Xx34YgRqkDKdhJY7uziJ8H5XucGqPENg9H9LAqwas9t7eGU5tZaSWn6TADqdpfT6cng9vVp612qubkY9Yifp8jMRs4iNufj4VFQlhvLaimozFx6v4cDYZg0efYULGI3mLj2effRbnnHMO2traoCgKHn74YcfjQgjccMMNmD59Ompra9HR0YHt27d7Nd6yoCBtEnch4Ih8FHYRikajCJirm36sTAIA7H1/S0H7AlD5DqdZDKeOahcZQfDqwm0zs6ZSOsVGPqTnw95e3Zv3MtXhVC4sp1iCrBSeD8Am9NSA4TGBGfkotM/HeEy7CGGZy2WTsfFW7UOIl+QtPoaGhnD88cdjxYoVro/feuutuPPOO3H33Xdj06ZNqK+vx4IFCxCJRFyf70fEGNqrB7I01MqHkeEB6/fdtUcAAIZ3bytoXwAqX+2S1XCaqkiRAiFZYHQi89j2yIf0fBTX4VSmXawGcmqW11UASSvyIReWs5lZPfR82MVHIpFaWyduM5zKipi8keIjGS3dYnh+w/Y6Y4pZ7VJgxJMQAgRGf4qThQsXYuHCha6PCSFwxx134Hvf+x7OPfdcAMB9992HlpYWPPzww7jwwguLG225MMVHrrRLIFC85yMyMgjAWNl0uPkIYHgjlE/fLWhfAHJEPgrf5ZiwG041lxRCCapGrF3bGoGJIhetk5UoMu0iF3/z0vMhe4kE5aq2UKErmuGE8TDyYU+pWD4iW7VLUZEPe8QjPgyEJxQ6zOrBJjT0QI3xP8XIByEF46nnY+fOneju7kZHR4e1rampCfPmzcOGDRtc/yYajaK/v99xqzRj6XAaCKjWhbzQSSNmio+oEgLqpxr7jfYVtC8AmZEPlLnD6ShpF0VNLemuezTRypbXihKAbh670H3LPh8ZkQ8vPR9y8TolJT5SaRfvzpPd82GlXRR75KMYz8eI++/7MzahkdTqjF/GS9SHkBLgqfjo7u4GALS0tDi2t7S0WI+ls3z5cjQ1NVm3GTNmeDmkAhm92iWoqlbZYqFLa8dGjG+QEYSh1TQCAAKJwYL2BcAHpba2Sdr+rVCmXbSU4VR4lHYRQu5bs16vKDDykVrbJVWOCsBZxeNRqW3AFvmQEZvSez40o507DPETLbTaJT6U+j02lP15+xP2cxM0FlBUmHYhpGAqXu2ybNky9PX1Wbddu3ZVeki2tEsOw6mmIG6KD6uaIE9iEePCHVNqEKhrAgCEq1J8uEzStou1rBoxIh8Bc0jeRj6cno9C13ZxNhmz1u1xpIu8ER+qJT40K2Ljrecjdc51u+dDT722giMf6WmX8UDS+IKhCwWaKT68FIuEjDc8FR+tra0AgJ6eHsf2np4e67F0wuEwGhsbHbdKI9MuuTwfQU21WlUnCu0rYYmPMAL1pvhIFvFNMmNhuTJHPhyGU9vEZlt/RVjRCa8Mp7KMV7NFVQqNfDjbq6ciH5rnfT6ksClV5MPeQMy+qnBMNyMfRVW7jN+0Sxya1d2YkQ9CCsdT8TFr1iy0trZi3bp11rb+/n5s2rQJ7e3tXh6qtIyh2kVTldQqqoWKj6ghNOJqGKG6ZgBAjSjim2TF0y7ung975ENRZZMxjyIftmXvLfFRoEBIpKddlNSaKF55PmRqR0Y+Eqbh1Nh3aapdhM3zERMy8qE7fCF5MS7TLsb/eAIagpb4YOSDkELJu9plcHAQO3bssO7v3LkTr776KiZNmoSZM2fimmuuwY9+9CMcfvjhmDVrFq6//nq0tbXhvPPO83LcJWb0yEdAVSzPRyJemOcjaV64E2oNahomAgDqhQeRj4p2OM0hPrSg1QjMK/Ehj63aUiOFt1c33qdQQIWqlMrzYexTmll1pDqzlqrJmL29uqyu1YrpcDouIx/Ge5iAhgDFByFFk7f4eOmll/DFL37Rur906VIAwCWXXIJ77rkH1113HYaGhnDFFVegt7cXp59+OtauXYuamhrvRl1irLSLnHDcnqPYIh8Fltwlo8aFO6HVoG5CMwCgXoxA6DqUHJU22alAkzEhRu1wKsWHqmmp1+XRRKs4zKxS2BTX5yOgKgioKgKung9v+nykIh+pdFGp+nwIh/iweT48KbUdH5EPkYxBARBHAKGQ0efDiroRQvImb/HxhS98IdWEywVFUXDTTTfhpptuKmpgFcXyfOR+mrWQWYHrsehR4yKe0GpQ12hEPgKKjuHhAdQ1NOW/w4w+H/IFlLDRh13YOAynqYnNEh9qAEItTiBkHj/VwMyqdikwFy/TLgFNQUBTsvT58KbDqT3yIT9Hperz4fB8JFOeD2+qXcaH4TQejyEEKT6MRSVVRj4IKZiKV7v4EUVO4jkiH0BKfBQa+RBmpYCu1aKuvhFJswxyuL+3oP1l93yUUHzYRUTWyIcp5rTUJO5V2kW1ldoWK2ykMAioqhH9sJdce9bnw6ykMsedEN57PuJJ3fKWAE7PR1R4UO0yDtMucTO1mhCaFflQGfkgpGAoPlwZ3fMBALpcS6RAw6m8cItADRRVxZBiNC8aGvi0sP1leD7K0GTMPhkrGqAGM7YrNl+GTGUpHkc+VHt0osj26gFNQVBTbZEP71qgy8Xr3D0f3kxm6aJC2Na/kVYQDaIw8aHrzvLacZJ2ScaNBSDj0BAOU3wQUiwUHy7U6sYFNRmoz/k8K/JRYJ8PYYoPPVgLABiGIT6iQ70F7a8i1S72yXi0Ph+BoOdpFyli1EDq2IVGVaQfI6ip0FTF5vnI0rm1oGOYwtbm+bAiHx6dp5F0UWGrRoqYpbYBJYlIQs+ZQnUlkbZG0ziJfCRiRuQjaTeceujRIWS8QfHhQm3SaPQVD+X2XSRNg2OhaRc1YV64A4b4GFFN8TFYYIv1SogP+2ScJe0iv+Vrqs1w6tGF217GmxIfxaVdNDUt8qGkuqcW3+fDGflIItUvxqu0S4aXw2oTryKaTK1qm9SFIz0zJtKbio2TUltZ0ZZUAtA0I7qnMfJBSMFQfLhQpxurzQbN8tdsFNtRU5Hiw4x8RDQj0hIfLjTtUoFqF/uEmaUfhvR8qIGgzbjptecjCCguDc7yQAqDoKoaplPXyEexC8uZkQ9LfGieG06zRT6EqiFiig/52vKueEkXH+Ml8pFIiQ81YEQ+mHYhpHAoPlyo143IR82EKTmfp1vVLoVGPowQthIyIh4xU3wkRgpcXK8ikQ9j8hKKirv/shPDci5ziXyommaYTgHPFlGT6QtNK74cVvbgCGgKAqoCVZHGY9XztV2kIEtChW4tAOhN5COjeZh5juJ6Ksoiozp5+z7Sq1vGSXv1pD3yEWDkg5BiofhIJz6CEAwx0dA8OedTU5GPwsSHljS+Napm5CMRNJYm14sWH2VsMmZO9DpU3Pz429iye9ixHQBUyLSLvVmXt5EPVQumohMFTuLWcvdaep8P7zwfVrWLuZ8kVKtfTKkiHzLKEhepst6g2b0173LbdIPpOEm7yC7Gds+HtfYPISRvKD7SECNGyiMpFDQ2Ned8rvR8FBr5COiGg14LG5GPRLDBGEOkSPGRsbZLKft8mFUbphDbN2JekN08H4Gg0Y8D3lW7KI4Op7LPR2H7loZTzUy7aA7Ph1cLy8m0iz3yIT0fpTKcGvdjumJFPsKq8Vrzjnykp1nGSdpFrlydVIMIBI1qF42GU0IKhuIjjZGBTwAA/ahHc13urqyiyJ4VgaSRdtHCRrpFhIzIhxItVHxkaTJWhsiHnNT2jZjHsk3SMsSvBTRLfHiVYrAiH4GAIUBQeNvrVJ8PBQFNta1qG7CEjXd9PlLvW9LrUlsz7RJQTXOpSIkPmeIJWuIjz8/GOE+76EoAWsD4DAfAtAshhULxkcZQ3z4AhviQS6tnQ65TUqjhNKiniY+wIT7U2EBB+8vwfKAc4iNlnARgte92Rj5Mw6kW9LzPhxQ2RtrFyMUX2149qBlNxhyr2lqiyZsOp3I/SWEznHrl+TCjGc11zpLQqK4gIZxpl4woyWhkGE5HER9P3gDcfToQHczvOD5DTxhRSl0J2NIuemmjioTsx1B8pDHctxcAMKQ2QBmlv3pqCffC0i4hYVzQgmbaRalpBABo8UIv1NmqXUrf4TRhds50+xZveT4CASiatxOtldLRNCiacexC+i8IIay0izScOsSHR1U6luHUFjGSnW299nxMrDPEWEp8pKIsUnzkn3YxxUbAjAqO1l791TVA9xtA18v5HcdnyCUUhJoSHwCAAv/3CRnvUHykETPTLiPahFGfK2RfiUIjH1J81BqRD63W6CsSTBRo4qtgnw85qSXg0ufDqkixeT48GlMqpROEYnVXzV98SFEAGKW2QU11ej48KrU1jiNspbZ2w6k3gkymUibWmyWh5uuIJlMRqoBSqOfDFBv1U837OTwfug4MG5FEDO3N7zg+Q08aaRdh83wYD1B8EFIIFB9pxIYM8RELjC4+UmmXwi5ANcK4oIVqnOIjlCgw8lHBDqdxIcVHZsMszYp8aFA1KT68+ZavWdUuqUqaQvadsDXb0syF5ZyeD48iH0kdqm2hP8PzISMfXokPYz+T6pxVGVGb4TQgIx+JAj0fdWYlWK726iOfpj57UoRUKdJULpQAAsGg/YEKjYiQ6obiI42k2do8Fmwc9bnWEu4FRD50XaAGRuQjVGtUuYTqjGPW6KNEPjasADbe7TKgypXaptIucn2VlCDThIx8hGwCwePIhxYw1ncBCqoaidv+JqC6pF086POh6wK6QGq/MN6vhNeeD9Nwmhn5sImPQvt8WJEPswdOrrTLsC3aUfWRD1N8aAGEQva0C8UHIYUQqPQA/IaI9AIAkuHRl7QXRawlEkkkLfFRY6ZdwmZH1dpc4iPSBzzx/xm/H/c1oG6SbUCVazImxYf1MxG3Plzym3fAHvnwqsOpi5+k2MhHUFMRUBRoVpMx26J1RUQnpMCx94dIQrXeM689H021QShKSmjYIx+pPh8Fio86U3wkRgyxp7p8j7ELjuHqFh9STAsliFAggKQwPx9MuxBSEIx8pKGY4kOEm0d9rox8FGI6G4lEEVKMC3/YjHzUNBiCp07k+DZpv6Dv25E2oCziA2UwnEJDTVC1vhX2DxuVPEIImy8jFZ1Q4I0g0hyej8L9JFb/DcVY2yWk2d4zRfXE82H1EXFEPrxf20WKj7qQhtqglupmmkhFpuQY8q52iaVFPgBDgLixH0U+hEy7aEEENdXyNsn+H4SQ/KD4SEMze2yodc2jPlcUMSFFRlLRDdlkrGaCEfmoRwR6tmXhhz9J/Z4hPmS1S2U6nE6sC2Fig9GtdWDYmJCSurC+eQfspbYepRhkF1Kj2qX4yIfsjRFWbe+ZR54P2UE1I/LhUp5cDLLPR01QRW1QS63jYku7aFa1S74dTk2hUZta92hoaAALbn8W3/qvzc7nOiIf1e35EDLCoQYRDKiImyJOLjhHCMkPio80QnFjRVmtLveicgCsb8OigNBrbMRmKjXLFic0GikUVREYHnJf2Va3XdDFx9udD6Y3GTPXjMGuv+Y9vjEjZORDRVNtEJMbjWMOmJGPRFK31kgJBAPG4nJIlcgWhc2noQVCRq8PFBj5sMSHmZZQbZEPR5+PwsedkKvmpkU+4vBWJMpoRm1QQ4098mEXHwV7PkzRHJ5grcb84rZd2NYzgCe2diNmN7DaP6tDHxfyUnyDkBEOLYigplg+nTjFByEFQfGRRjgxthVtAaQmpGxRihxETPExgrAVqQjX1CFuNoEa6ndf2Xbg0x7r95Hut50PpqddTrrEGOO2PwJvPZr3GMeEVWqrYWJdCFMbDf/K3v5h6LpA3NZ6PhAIpbqQejHR2iIFAU2DKiMfBUQQ4rZF5QAglB75UIpPjVhNzMyog4ACUULPR01QQ21IsyJPI0kgKaT4MKMhhXY4DdZZ4nbzjo8AGNq3pz+Seq4t7ZIcqG7xYXk71ACCaqo8OhGj+CCkECg+0qhNyhVtcy8qB8AWis8/8hE30y4xpJzziqpiUDEu6JEBd/Ex0pe6iIu96ZGPtLVdWo9BbN5VxhD/37UYfOP/YeSdPwNe5qmtDqcqmuuCmHOAEb0ZjkTx5Fs91oJcgGE41UyB4Enkw7aPQDAE1ezzUYifJGnrbgoAIXt/OY/6fEjxEZYfG9PE6laeXAwymlEbMnw4VpQjkfrGntpWoOE0WGsIEACv7+y2Hv6oN+X/iA/ssX7XIp96tnZNRTB9XYoWhKqm3sdEges6ETLeofhIo14Y4qOucXTxIaymVgV8044YF/Go6lw/ZkSKD7PkN51Yf0p8hAc+cE5YaZGPe194H8f++QTs1FugDnaj4f/+T9SuOQ89d37R6R0pBof4CKG+1mjAFEQCd67bjpjt4qwFglBkasSLFUFt77vd81GIsJFtzzVVrn1i24e91LaI6IRMu0g/iSU+dK87nBr7rw0ahlMZ5RhJwFrELhX5KFB8hOot8REdSS0HsLsvJT4ivSnxoUA3+n5UK/LcmGnDpDScMu1CSEFQfNhJxlEHI2zc0DxllCejqGXWE1Ej8hFXwo7tI6qRtogOuns+koMp415AjwG9nakHbeJDCIH/fH4nogjhmvhivKQfgTf0g9EvatHS/ybEPWcBA90oGluH0+a6INB0IADgaLUTW7r68PSW3dZTFTUA1VyUS/Uk7WJrZBYM2jwfBXQ4NT0fQVN8hEyfig7VSIt54PmIW8cwfsr2/F57PqThtDaooSagWg3FRhKpKItasOfDFBfBWuMGWCXjANDVm0q7JAfTUi1VVG67u28E27pTokoxIx/yM5ZQ6PkgpBgoPmxEBlIT+4TmfNIuBUx2UnykRT6ipvhIDLuLD4ykVQ3YK15ig9a4XvuwDx/sG0ZtUMNvbvw2Tvz+X3HIv23G18UP0SOaoezZCvzff8x73BlYbcI1Yy2RmZ8FtBDalL04RNmNVc/aUkNKqs+HF2mXpK3BU1ALWp6PQvadsDwfTsOpjE54sbaLPEaNFB/mv1/c88iH6fkIaagLpvJHw8lU5EO+R/l7PkzDabDeiH4AqEMUUycYItoe+QhE0qJrVVJuK4TA1365Aef8/Dl8PGAIK6uCSjPSpLqZdmGpLSGFQfFhY6jXuDgOiFpMqA2P8mwAWuGRj2TUCF+ni49YwBQfI+7iQ4sYoeuYaUy1xEfPVuDjt41v6G0n4pFXDRPgl49qwYSaIDRVQX04gJmz5+KC2PVIKEHg/b8AO/+S99gd2JqMNdeGDBPizHYAwPzgm/hkwNYDQtWgmREE1YM+H/F4mp8kUHgPEavaRRpOzWiBnKxTQrPwcctjyB4iuvleyLJNr/t81AQ01NvMKyPxlFdBRp6KiXzEVeN/pBYxfO0zRsRrt4x86Dpq4r0AgA+FGUWsksjH3sEYdn0yglhSx5Yu8/9Qd0Y+kors80HPByGFQPFhY8hc0XZAGX1FWwCpplaFiI+YcRHXNafISQSNhmN6pN/170LRXgDAFjELABDv2WY88Mp/GT+POBPJuql47HUj3XHuCW2Ov//bY6fjfTEdj2nzjQ3rb8l77A5sfT6azVVUcaix7wsnbrdERhIKoCjQAtLwWPxEK1teJ4RqVLuY50MrKPLh7PORSo2ki49iIh+m+FCcJdFeez6stEtIQ72th/FIUliltqnIR6Gejzp8GjfO94wJAp85yDAad/WZ4iPSa/UX2a4fAAAQ6WkYn7K9J5Vu2bHHiCaqUnwEjMhHUmHkg5BiGDft1Tv39OGaVWut+8cd2IQbzj4KqhQZdZMwYlaYDKkNY9qnXEW1kKZWwixZTGq1ju1JU3wgi/ioTfQCADbrh+NEdQeiPdsQTESB1+4HAMSP/zrWv70HHw9E0VwXxOcOn+r4+y8eOQ21QQ23Dp6Fv6t7Gur7fwHe/iPQeiwwYXoqmjNWbB1Om82FzHDofOCpG3Ho0CuYGPya8bqgQQOK6sWRTiKRSvkENAVaEfuWhlPZ50P6JKy1ajzs8yHLeC3Phyy19WptF1ufj1rb6RxKpFYfVgoptRXCkXb5JBbANACzGhVMbzYieDLtMtTbg3oA/aIWu4VZAWVu8zvb96R68LxjChH5BUP2qZGLSibjjHwQUgjjRnwofe/jD9ErUhveBfAz2xPCTcDsKwEAEW30FW0BQLHWKcl/0rDER8CZdtFD5oJ20dS3r8ienQg3TYESakCDboiSl8WRAB6H9sm7Rh+PkU/QH5yCk+6LIyFeAgAsPGY6QgFncKs2pOGLs6fij28k8Zj6Rfyd/ifg/osAAHsaj8G0f1mfnwCxGU4nyshHyzFA/TQoQ3tw9WEfAzttVRYB79IuCfNbZxIqQqpimVkLiapYhtP0tIuHng9pOE0XHzEr8lG8+IgndSvCUhvUIE8JAIzEbeJDRj7yKbVNRGG16g/WYk9UxWwABzQITG8yRHTvcBwjsSR2f/QhDgPQpzQhFpoE6NUkPgZsv8vIhxQfMvJhio8CV7QmZLwzbtIu0xproWth6FoYcSWMiAgighBEoMYIf0f7MP3d3wMAYoExig/r23ABE5KZOxcBZ+QDNcaxtZhxAdz74TYEVpyEd24/G4j0Wf0ZBqadBACojfQAfzBE1X0jpyFhekGmNIRxyWcPcj30eScYYfCfDJ6DnXoLIiIIXSiY1v8mdj+9Mq+Xoeup6EOTnOlUFTj0iwCAM0JvAEgJNWk49SLtkrQiHyoURUkJm4LWdpGeDxn5sFW7AJ6u7RKSgQ5zApPnzAvxYU+j1IRU1Gqp9OFQHEgKZ5M3maIZE3HbmkPBOuweMl5Ia42OxpoA6kPGvrv6RrCnx/AcRYITIeoM83asfw+qgXd6UpGPHT2DEEJAsdIuxmdcnjud1S6EFMS4iXyEW44ArjcufvFEEmfe/ize3zeM/zlvJi4PrsWsl36ExsGdxuOh0Ve0BWwTagHfhhW5GFda5EOtMY4dNNu8f/jG85ii6Dh4ZAuSA3ugARgUNTj8kMPx+ouzcJy6E0jGMCTCuD85Hz867xhcePIMaxJ148tHteB3l5+KT4djeAtfxlsA9v55Bf7hk59jwoZbgNMudq6Wm4NINIo6GCWczbW2pcYPnQ+8/nuEd64DAISDxkU7YEUnio98JJOpyAeQEjZqIZEPueKs6uxwmhIfxTcCk11UZYdTuc+YfCs88HzIlIuqACFNRV0g9T6PJFKeD5lGiCTyOA9SfGghDCaA7ogKBIAp4SQURUFbcy227xnE7t4IBvYanqNk7SQE6qcCg4AYrA7D6Q5b2mUgmsBHbz6LI+NvAQrQcsAhAFIRMZ2RD0IKYtyIDzvhgIZ/O+soXH7fS1izqRPPKlPxnM33qYfHJj6QZ1+JLV19+NlT2/Hcjr1You/F/ACsRk2S0GQjWtEYMb45RvcY1SxhJY5db2/CDAC9aMDJB0/E+c//AMfWfYqhWBK7E404v30Ovn6qe7TDjqIoaD/UWUr8fst38dYvHsEcdKLv3gvRNGsu0HYicOxXUwvVuTASjaEOgKpqzhTP7LONxcdkYyk1LfLhRamtzfMBwPJ8FJLSSU+7aGZ6wVvPR3qfD2OfMQ89HxFTydQENSiKYnk+ktAQievW61EKMZzaWqu/vbsfI8L4p5F9Pqab4qOrbwSBXmMZgMCEaahpbAF2u5Te+pB9g1F8MhSDogBtTbXo792H+kevRUDR8ZR2Or509OcAALpiej9oOCWkIMZN2iWdjjnTcOlpB+PwaQ2INhyIN/SDrcdEbfOY9mF16xxD2uWnT7yNs+58Dn/a2oPhWNK6YE9qdgqdyQcdBQCYnuiCnkxC+/Q967HB7c8BAPqURhzZOgEJBPDK8FS8k2jFqUfNwvfOPmpM43bj4GlN2HDEdQCApp5NwMa7gD9cjsgf/y21YJ0LI1Hj4iuNpBbhBuDUxan7SvECIR3Z50M3q0YCRVTSZBhOzchHUla7KMWnRtI7nMoKGi/7fNjNpgBQEzD2rUNFJJE0qo5MFOiIxJMQOc6vA7moXKgeW3f3YximYjdNqG1NRhSvq3cECXMtl9qJLWiYPN0YSzyPDqeDHwP/eSaw/lbjfjIO3H8x8Id/yvl5dOO1Xb04845n8cSW0ZvqyZTL3zbuxL8HVuKR0PcwMdaFXfpUPD/ne1DMz4csk2bkw4W3HgXu+izw0ebRn0vKxoefDuPcFc9j9fM7Kz0UACUUHytWrMDBBx+MmpoazJs3D3/9awlXVi0ARVFw4zlH48mlf4OHvv1ZPCVOsR5Txyg+rDD/KBPS2939uOuZdwEAf3d8Gx799qlYdITxtwdOc0Ygph88GzGhoVaJ4eOP3kP9UKqDaf0e4595SGvCjEl1VqThb49txV0Xn2StS1Iof3fe13CdsgQrEn+HNQmjXLbmxRWI/f4bwJ9/Avz116lqB5No1BBRgaBLEO2Uy4GwaaA1J9qgmXYJKHrek0g6sseCFfkwzYDFeD5ShlNnVKWYbraSuLWwnLPDaUyXKZ3iBZl9UTkAqAuYERxFQySeTL0eAAHo0EXKCDv6C0j1+Nja1W8simjbLk2n//1aF8IxI8oxaWobJk01xMeEZO/YX+Nf/gPo3AA8sxzY9y7w2u+Atx8DXr8f2Pb42PYBo2HYDx7dgre7B3DjI1sQHcVgu2PPAA7Ax7gj+n20D/4Jh6jdGBZh/Et8MebNPji1X0t8eFMevd8QjwB//C6wZwuwdlnR/+PEO2770zt4bVcvbn78beyxLwBZIUoiPn7/+99j6dKluPHGG/Hyyy/j+OOPx4IFC7Bnjz8NZwdOrEPDCV+x7gfrx7CiLVLiY7S0y78/8Q4UoWPJobtxZ8O9OPb+k1H3nln2W+OMfASDIezWjIv1x+9vwdT4R9ZjbVFDwESDzQhqKn76P47DtWccgTsvPLFo4QEYJtV/++6/4UtXrcDBl/4aP1ENI2vo7YeNfiB/vBbJn51kiJC3HgXeewYRKT7SIx8AUNtsCBAgI+0CAKJIg6VuriYsO4XKResK8ZNI8SE9H5mGU3PSLiI1kpSeDzPtosjIh1CK3rdkxNbjAwBqzGEnoZppl9TnRM13cTlb2mWrLe0ivSCy3Pa9j4cwCUZVVm1zC6a1GgZnDTpEpHf04wx0A5tXG78LHXjmZuDZf089vv7mMU9qz+3Yi5c7jWN290fwwEsf5nz+9j2DWBp8EEHE8enE43BFbAn+Jno7XsURaD/UtuSCmTITjHw4eeW/gAFzSYVdm4D3nqnocIjBzr1DeNhsPBlN6Pjls++N8helpyTi47bbbsPll1+OSy+9FEcddRTuvvtu1NXV4T//8z9LcThPuOBvO/AujC6N4Smj+yYAQDGd71qOtMvLH3wCbdtjeCr0XfzLR98xLqrD+4DaScDJ/wgcc37G33xSMxMAMLjzJUxBr7VdLo2eqDHE0bknHICr5h+e01yaL011QcxubcRnD52C//FPN+Aq9d9wb+LLuC/xZXTqU6ENdQN/vBb4/deB+87FQe8/aIwt6CI+AKD9KuDgzwEn/j0AIGh7XrLIb41W5EOmdOT5KKjU1tle3XqvrchH8WkXa20X03Aqvz1HrchH8d+iI2lpl1qzm6qsqEnYIh8yujPmihdTZIhgLd7uHsCIXJHZ3N7WlKrcmqKa5ap1U9A2uQn9wnisb98Y1hN6/k4gEQGajf8DvPEA0PsBUDfFaOu++zXgnSdG3Y0QAj97ymjvP2OScfyVf96RM/oR+fA1fEV9HgDQ+8Wf4E/6yfgYzThhRjOaalOfXXnuBDucpkhEgeduN35vNq+h629h9MMH/OLpHdBF6v/gt5s+sJYOqBSeG05jsRg2b96MZcuWWdtUVUVHRwc2bNiQ8fxoNGqF7gGgv9+9uVapaawNofur92H91hfwuVPOGNPfaOa3+CmJ3Xh+1XVQahqhQoe6921M730ZU/R9OBoCvwyZF6iaZuCoc4Gjv2JMyFn6aUSaDgGGX0DTh0+7Pi5qxhaZKZYjWibgx99dglc6P0U8KXDJI6/gy4MPY2HoNUwNDOPA+PtoiBjfclzTLoBRNfONx6y7cv0VANhy+zlW6qEQasyKoPQeIgFFx6u3npnXvmbHkvh1MIHpXTXAmiYcsc94XSMJgX+890VMjnfjFgB6bBiv57lvyUzzGIf1makr0z/Q1RcFVCAx9CneLHDfkrqEcYyJg0FgzSQc1m/4LBIi5f2Q3BX6OSK6il133YkudfSOvhMTe3EQgDc+TiCW0CHCZteOj98B1lyIE2MJ/DporD10qNpttASpn4xwQMMepQmNGMHu/7oc749Syn7UyGaEANwevhIdtX/AsSMvAgAeqDkfjcFenNn3AD554Cp0hg/LuR9dF/inoRiuDCn4bMtkvDjyCaLDOl6/9RYEA+6CffHQe1AVgU9nnYPpcz4LVVkLXQCfP8LZrE+Kj9b3HsSrt27M/caNExqS/Tgs+hE+0abg1oYf4oe9lyHYuQFv3volJJTQ6DsgpUEAC4eiODMInDJlErYlBtA3Ekfnfb/H1MX3VGxYnouPvXv3IplMoqWlxbG9paUFb7/9dsbzly9fjh/84AdeD6Mgjjh6Lo44eu6Yn18/2QgnT0YfTtv1y8wnmNfzEYQQP2UxGucvBWoaR91vYOoRwG7giNhWQAE6lQMwU6TSL0r9GFbc9Yim2iC+cOQ0AMDs1s/jgl8G8au+c1CHCDaGr0KjYnzrbaipybUbi9q6BvSjHo0YwvEj3viABoKGb6ZhQjOGRRh1ShQnDGcK3VHRAAwCeAdoNjd1J5vx1Ft70IAIbgoHEVbihe3bfgyzQEKdYKTXPozVIxFWEVASxe3bfowogHcAOc3vEc0AgOb6WiA0CRj5BJ9TXjWem2f6d+ugsddJ02cB3QBiA8A7j6MewJellhQA1CDQZEQv9tXMxIxIN+bE3rRefy4264fjZx/MxLPK2fi/oZewD424setU1CGK08P/jUnJjzFpeAzt2uV4dgKfl/djyD4GBYiJAGrOvBE1QQ3HHtiMNz7sRccc5/Us2dAK9AIH6R/ioOHcqZzxxm2Rc3D/jgCODMzHpYEncMwIjacVR2rtTuAUANCA9/Z0YzCaQEO4MkWvihiz1X1sdHV14YADDsALL7yA9vZ2a/t1112H9evXY9OmTY7nu0U+ZsyYgb6+PjQ2jj5RVxKh63hj/R8w/MFLCPTvgpIYgRACsbrpqJ/9BUw9+DjoABont6CxcezRirc3/QmzH/+qdf/FpjNwWN8GTIQRyn7uhJ/i9POuyPbnJeWToRj+/PYeJHQdJ2y9BUfu/A0AIDHv2wgsXD6mfezc+iI+fut5bwakqjj4lHMw7QBjrZt3X38B+7YXJmqCmoo5bY2oCagQQuCdj0fwdv0piNQYYm9i7xY0928rargBTcHs1kbU1YSBw76MDT0qOj8ZwuRPX0PjwLtF7VuiqQpmt05AvXlR2blvGG+Gjsdw3QGYe9AkHIYPgQ//in2DMbz78eAoe3Oiq0F0tXweyVAz/ubIqZi27yXgk1T+uMc0srU01gDTjgYONMT83u5OvLfhkTGlrYSiYve00xENG+/75E9fQyQ0CUP1MwAATf3vYFLvm2Mab1BTccwBjQhqKnQhsKWr3/LFZGPyYXNx6PFGSe3uvhF090Vw4kzn/+/wYB+2PvN76NFht12MW2LBRnzUOh9QVGjJCA7oXgctWdnwPjGuCXOmT0BdKAAhBNZu6cFxh87AAadd5Olx+vv70dTUNKb523PxEYvFUFdXh//zf/4PzjvvPGv7JZdcgt7eXjzyyCM5/z6fwe+v9H7cheYVc6z7G2Z+C827n8Wc+FYAwIufvwcnz/9Ktj8vH/veBX4+F4AAPvvPwBk/rPSICCGEVIh85m/PDaehUAhz587FunXrrG26rmPdunWOSAjJTvOU6eizrYIRnHYoBuoPtu7XT5xWgVG5MPlQ4PAvG7+HqmHVDkIIIX6gJMmepUuX4pJLLsFnPvMZnHLKKbjjjjswNDSESy+9tBSH2/9QFHQHZ6Apbnhkmg6Yjb2f7IIsfGma1Fq5saVz1m3AxpXASf9Q6ZEQQgipEkoiPi644AJ8/PHHuOGGG9Dd3Y0TTjgBa9euzTChkuwM1B8M9Brio2XWHAx+3AmYqfXmKT56H5tnAGf+pNKjIIQQUkWUzOZ61VVX4aqrrirV7vd7kpMOA3qBXkxAc/NUTDnkeOAFoA/1aKpvqPTwCCGEkIIZlwvLVQP1M48H3gN2h2ahGcCMw47FpqNvQM2kA3B8pQdHCCGEFAHFh085+nPn443ejzD12C9a2+Z99TsVHBEhhBDiDRQfPkXRAjj2vCWVHgYhhBDiOSVb1ZYQQgghxA2KD0IIIYSUFYoPQgghhJQVig9CCCGElBWKD0IIIYSUFYoPQgghhJQVig9CCCGElBWKD0IIIYSUFYoPQgghhJQVig9CCCGElBWKD0IIIYSUFYoPQgghhJQVig9CCCGElBXfrWorhAAA9Pf3V3gkhBBCCBkrct6W83gufCc+BgYGAAAzZsyo8EgIIYQQki8DAwNoamrK+RxFjEWilBFd19HV1YUJEyZAURRP993f348ZM2Zg165daGxs9HTfpHh4fvwPz5G/4fnxN/v7+RFCYGBgAG1tbVDV3K4O30U+VFXFgQceWNJjNDY27pcnfn+B58f/8Bz5G54ff7M/n5/RIh4SGk4JIYQQUlYoPgghhBBSVsaV+AiHw7jxxhsRDocrPRTiAs+P/+E58jc8P/6G5yeF7wynhBBCCNm/GVeRD0IIIYRUHooPQgghhJQVig9CCCGElBWKD0IIIYSUlXEjPlasWIGDDz4YNTU1mDdvHv76179Wekjjlu9///tQFMVxmz17tvV4JBLB4sWLMXnyZDQ0NGDRokXo6emp4Ij3b5599lmcc845aGtrg6IoePjhhx2PCyFwww03YPr06aitrUVHRwe2b9/ueM4nn3yCiy++GI2NjWhubsZll12GwcHBMr6K/ZfRzs83vvGNjP+nM8880/Ecnp/SsXz5cpx88smYMGECpk2bhvPOOw/btm1zPGcs17TOzk6cddZZqKurw7Rp0/Dd734XiUSinC+lrIwL8fH73/8eS5cuxY033oiXX34Zxx9/PBYsWIA9e/ZUemjjlqOPPhq7d++2bs8995z12JIlS/Doo4/iwQcfxPr169HV1YXzzz+/gqPdvxkaGsLxxx+PFStWuD5+66234s4778Tdd9+NTZs2ob6+HgsWLEAkErGec/HFF2PLli148skn8dhjj+HZZ5/FFVdcUa6XsF8z2vkBgDPPPNPx//S73/3O8TjPT+lYv349Fi9ejI0bN+LJJ59EPB7HGWecgaGhIes5o13TkskkzjrrLMRiMbzwwgu49957cc899+CGG26oxEsqD2IccMopp4jFixdb95PJpGhraxPLly+v4KjGLzfeeKM4/vjjXR/r7e0VwWBQPPjgg9a2t956SwAQGzZsKNMIxy8AxEMPPWTd13VdtLa2ip/+9KfWtt7eXhEOh8Xvfvc7IYQQW7duFQDEiy++aD3n8ccfF4qiiI8++qhsYx8PpJ8fIYS45JJLxLnnnpv1b3h+ysuePXsEALF+/XohxNiuaX/84x+Fqqqiu7vbes7KlStFY2OjiEaj5X0BZWK/j3zEYjFs3rwZHR0d1jZVVdHR0YENGzZUcGTjm+3bt6OtrQ2HHHIILr74YnR2dgIANm/ejHg87jhfs2fPxsyZM3m+KsDOnTvR3d3tOB9NTU2YN2+edT42bNiA5uZmfOYzn7Ge09HRAVVVsWnTprKPeTzyzDPPYNq0aTjyyCNx5ZVXYt++fdZjPD/lpa+vDwAwadIkAGO7pm3YsAHHHnssWlparOcsWLAA/f392LJlSxlHXz72e/Gxd+9eJJNJx0kFgJaWFnR3d1doVOObefPm4Z577sHatWuxcuVK7Ny5E5/73OcwMDCA7u5uhEIhNDc3O/6G56syyPc81/9Pd3c3pk2b5ng8EAhg0qRJPGdl4Mwzz8R9992HdevW4ZZbbsH69euxcOFCJJNJADw/5UTXdVxzzTU47bTTcMwxxwDAmK5p3d3drv9j8rH9Ed+takv2fxYuXGj9ftxxx2HevHk46KCD8MADD6C2traCIyOk+rjwwgut34899lgcd9xxOPTQQ/HMM8/gS1/6UgVHNv5YvHgx3nzzTYeHjbiz30c+pkyZAk3TMpzFPT09aG1trdCoiJ3m5mYcccQR2LFjB1pbWxGLxdDb2+t4Ds9XZZDvea7/n9bW1gzzdiKRwCeffMJzVgEOOeQQTJkyBTt27ADA81MurrrqKjz22GP485//jAMPPNDaPpZrWmtrq+v/mHxsf2S/Fx+hUAhz587FunXrrG26rmPdunVob2+v4MiIZHBwEO+++y6mT5+OuXPnIhgMOs7Xtm3b0NnZyfNVAWbNmoXW1lbH+ejv78emTZus89He3o7e3l5s3rzZes7TTz8NXdcxb968so95vPPhhx9i3759mD59OgCen1IjhMBVV12Fhx56CE8//TRmzZrleHws17T29na88cYbDpH45JNPorGxEUcddVR5Xki5qbTjtRzcf//9IhwOi3vuuUds3bpVXHHFFaK5udnhLCbl4zvf+Y545plnxM6dO8Xzzz8vOjo6xJQpU8SePXuEEEJ861vfEjNnzhRPP/20eOmll0R7e7tob2+v8Kj3XwYGBsQrr7wiXnnlFQFA3HbbbeKVV14RH3zwgRBCiJtvvlk0NzeLRx55RLz++uvi3HPPFbNmzRIjIyPWPs4880xx4oknik2bNonnnntOHH744eKiiy6q1Evar8h1fgYGBsS1114rNmzYIHbu3CmeeuopcdJJJ4nDDz9cRCIRax88P6XjyiuvFE1NTeKZZ54Ru3fvtm7Dw8PWc0a7piUSCXHMMceIM844Q7z66qti7dq1YurUqWLZsmWVeEllYVyIDyGE+PnPfy5mzpwpQqGQOOWUU8TGjRsrPaRxywUXXCCmT58uQqGQOOCAA8QFF1wgduzYYT0+MjIivv3tb4uJEyeKuro68ZWvfEXs3r27giPev/nzn/8sAGTcLrnkEiGEUW57/fXXi5aWFhEOh8WXvvQlsW3bNsc+9u3bJy666CLR0NAgGhsbxaWXXioGBgYq8Gr2P3Kdn+HhYXHGGWeIqVOnimAwKA466CBx+eWXZ3yx4vkpHW7nBoBYvXq19ZyxXNPef/99sXDhQlFbWyumTJkivvOd74h4PF7mV1M+FCGEKHe0hRBCCCHjl/3e80EIIYQQf0HxQQghhJCyQvFBCCGEkLJC8UEIIYSQskLxQQghhJCyQvFBCCGEkLJC8UEIIYSQskLxQQghhJCyQvFBCCGEkLJC8UEIIYSQskLxQQghhJCyQvFBCCGEkLLy/wPQWSYvnY2ogwAAAABJRU5ErkJggg==",
"text/plain": [
"<Figure size 640x480 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"smiles_list = [\"CCCC\", \"c1ccccc1\"]\n",
"mols = [Chem.MolFromSmiles(smiles) for smiles in smiles_list]\n",
"\n",
"features = descriptor.transform(mols)\n",
"_ = plt.plot(np.array(features).T)"
]
},
{
"cell_type": "markdown",
"id": "fdcb0698",
"metadata": {},
"source": [
"If we only want some of them, this can be specified at object instantiation."
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "6caa9a54",
"metadata": {
"execution": {
"iopub.execute_input": "2025-05-08T16:22:34.357638Z",
"iopub.status.busy": "2025-05-08T16:22:34.356566Z",
"iopub.status.idle": "2025-05-08T16:22:34.363282Z",
"shell.execute_reply": "2025-05-08T16:22:34.362201Z"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Selected descriptors are ['HeavyAtomCount', 'FractionCSP3', 'RingCount', 'MolLogP', 'MolWt']\n"
]
}
],
"source": [
"some_descriptors = MolecularDescriptorTransformer(\n",
" desc_list=[\"HeavyAtomCount\", \"FractionCSP3\", \"RingCount\", \"MolLogP\", \"MolWt\"]\n",
")\n",
"print(f\"Selected descriptors are {some_descriptors.selected_descriptors}\")\n",
"features = some_descriptors.transform(mols)"
]
},
{
"cell_type": "markdown",
"id": "52eaef77",
"metadata": {},
"source": [
"If we want to update the selected descriptors on an already existing object, this can be done via the .set_params() method"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "78fc5691",
"metadata": {
"execution": {
"iopub.execute_input": "2025-05-08T16:22:34.366550Z",
"iopub.status.busy": "2025-05-08T16:22:34.366050Z",
"iopub.status.idle": "2025-05-08T16:22:34.372939Z",
"shell.execute_reply": "2025-05-08T16:22:34.371839Z"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"MolecularDescriptorTransformer(desc_list=['HeavyAtomCount', 'FractionCSP3',\n",
" 'RingCount'])\n"
]
}
],
"source": [
"print(\n",
" some_descriptors.set_params(\n",
" desc_list=[\"HeavyAtomCount\", \"FractionCSP3\", \"RingCount\"]\n",
" )\n",
")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "310a2a0d",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"jupytext": {
"formats": "docs//notebooks//ipynb,docs//notebooks//scripts//py:percent"
},
"kernelspec": {
"display_name": ".venv",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.6"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
================================================
FILE: docs/notebooks/03_example_pipeline.ipynb
================================================
{
"cells": [
{
"cell_type": "markdown",
"id": "e7c43298",
"metadata": {},
"source": [
"# Pipelining the scikit-mol transformer\n",
"\n",
"One of the very usable things with scikit-learn are their pipelines. With pipelines different scikit-learn transformers can be stacked and operated on just as a single model object. In this example we will build a simple model that can predict directly on RDKit molecules and then expand it to one that predicts directly on SMILES strings\n",
"\n",
"First some needed imports and a dataset"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "79139b10",
"metadata": {
"execution": {
"iopub.execute_input": "2025-05-08T16:22:35.876773Z",
"iopub.status.busy": "2025-05-08T16:22:35.876261Z",
"iopub.status.idle": "2025-05-08T16:22:36.754601Z",
"shell.execute_reply": "2025-05-08T16:22:36.753459Z"
}
},
"outputs": [],
"source": [
"import os\n",
"import rdkit\n",
"from rdkit import Chem\n",
"from rdkit.Chem import PandasTools\n",
"import pandas as pd\n",
"import matplotlib.pyplot as plt\n",
"from time import time\n",
"import numpy as np"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "17a9cdd7",
"metadata": {
"execution": {
"iopub.execute_input": "2025-05-08T16:22:36.758840Z",
"iopub.status.busy": "2025-05-08T16:22:36.758015Z",
"iopub.status.idle": "2025-05-08T16:22:36.767668Z",
"shell.execute_reply": "2025-05-08T16:22:36.766504Z"
},
"lines_to_next_cell": 0
},
"outputs": [],
"source": [
"csv_file = \"../../tests/data/SLC6A4_active_excapedb_subset.csv\" # Hmm, maybe better to download directly\n",
"data = pd.read_csv(csv_file)"
]
},
{
"cell_type": "markdown",
"id": "066131b8",
"metadata": {},
"source": [
"The dataset is a subset of the SLC6A4 actives from ExcapeDB. They are hand selected to give test set performance despite the small size, and are provided as example data only and should not be used to build serious QSAR models.\n",
"\n",
"We add RDKit mol objects to the dataframe with pandastools and check that all conversions went well."
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "a3ec0a23",
"metadata": {
"execution": {
"iopub.execute_input": "2025-05-08T16:22:36.770992Z",
"iopub.status.busy": "2025-05-08T16:22:36.770360Z",
"iopub.status.idle": "2025-05-08T16:22:36.828093Z",
"shell.execute_reply": "2025-05-08T16:22:36.826677Z"
},
"lines_to_next_cell": 0
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"0 out of 200 SMILES failed in conversion\n"
]
}
],
"source": [
"PandasTools.AddMoleculeColumnToFrame(data, smilesCol=\"SMILES\")\n",
"print(f\"{data.ROMol.isna().sum()} out of {len(data)} SMILES failed in conversion\")"
]
},
{
"cell_type": "markdown",
"id": "eccaf4af",
"metadata": {},
"source": [
"Then, let's import some tools from scikit-learn and two transformers from scikit-mol"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "4eb8f0fa",
"metadata": {
"execution": {
"iopub.execute_input": "2025-05-08T16:22:36.830959Z",
"iopub.status.busy": "2025-05-08T16:22:36.830663Z",
"iopub.status.idle": "2025-05-08T16:22:37.516946Z",
"shell.execute_reply": "2025-05-08T16:22:37.515550Z"
}
},
"outputs": [],
"source": [
"from sklearn.pipeline import Pipeline\n",
"from sklearn.linear_model import Ridge\n",
"from sklearn.model_selection import train_test_split\n",
"from scikit_mol.fingerprints import MorganFingerprintTransformer\n",
"from scikit_mol.conversions import SmilesToMolTransformer"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "99edec0f",
"metadata": {
"execution": {
"iopub.execute_input": "2025-05-08T16:22:37.521222Z",
"iopub.status.busy": "2025-05-08T16:22:37.520115Z",
"iopub.status.idle": "2025-05-08T16:22:37.527537Z",
"shell.execute_reply": "2025-05-08T16:22:37.526440Z"
}
},
"outputs": [],
"source": [
"mol_list_train, mol_list_test, y_train, y_test = train_test_split(\n",
" data.ROMol, data.pXC50, random_state=0\n",
")"
]
},
{
"cell_type": "markdown",
"id": "b8380817",
"metadata": {},
"source": [
"After a split into train and test, we'll build the first pipeline"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "a27d6ff9",
"metadata": {
"execution": {
"iopub.execute_input": "2025-05-08T16:22:37.531062Z",
"iopub.status.busy": "2025-05-08T16:22:37.530349Z",
"iopub.status.idle": "2025-05-08T16:22:37.539115Z",
"shell.execute_reply": "2025-05-08T16:22:37.538026Z"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Pipeline(steps=[('mol_transformer', MorganFingerprintTransformer()),\n",
" ('Regressor', Ridge())])\n"
]
}
],
"source": [
"pipe = Pipeline(\n",
" [(\"mol_transformer\", MorganFingerprintTransformer()), (\"Regressor\", Ridge())]\n",
")\n",
"print(pipe)"
]
},
{
"cell_type": "markdown",
"id": "6c12f9a8",
"metadata": {},
"source": [
"We can do the fit by simply providing the list of RDKit molecule objects"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "634ca919",
"metadata": {
"execution": {
"iopub.execute_input": "2025-05-08T16:22:37.542129Z",
"iopub.status.busy": "2025-05-08T16:22:37.541844Z",
"iopub.status.idle": "2025-05-08T16:22:37.609556Z",
"shell.execute_reply": "2025-05-08T16:22:37.608271Z"
},
"lines_to_next_cell": 0
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Train score is :1.00\n",
"Test score is :0.55\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"/home/anton/projects/scikit-mol/.venv/lib/python3.12/site-packages/numpy/_core/fromnumeric.py:57: FutureWarning: 'Series.swapaxes' is deprecated and will be removed in a future version. Please use 'Series.transpose' instead.\n",
" return bound(*args, **kwds)\n",
"/home/anton/projects/scikit-mol/.venv/lib/python3.12/site-packages/numpy/_core/fromnumeric.py:57: FutureWarning: 'Series.swapaxes' is deprecated and will be removed in a future version. Please use 'Series.transpose' instead.\n",
" return bound(*args, **kwds)\n",
"/home/anton/projects/scikit-mol/.venv/lib/python3.12/site-packages/numpy/_core/fromnumeric.py:57: FutureWarning: 'Series.swapaxes' is deprecated and will be removed in a future version. Please use 'Series.transpose' instead.\n",
" return bound(*args, **kwds)\n"
]
}
],
"source": [
"pipe.fit(mol_list_train, y_train)\n",
"print(f\"Train score is :{pipe.score(mol_list_train,y_train):0.2F}\")\n",
"print(f\"Test score is :{pipe.score(mol_list_test, y_test):0.2F}\")"
]
},
{
"cell_type": "markdown",
"id": "8440cc5a",
"metadata": {},
"source": [
"Nevermind the performance, or the exact value of the prediction, this is for demonstration purpures. We can easily predict on lists of molecules"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "f4431aab",
"metadata": {
"execution": {
"iopub.execute_input": "2025-05-08T16:22:37.613501Z",
"iopub.status.busy": "2025-05-08T16:22:37.613074Z",
"iopub.status.idle": "2025-05-08T16:22:37.625937Z",
"shell.execute_reply": "2025-05-08T16:22:37.624623Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"array([6.00400299])"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pipe.predict([Chem.MolFromSmiles(\"c1ccccc1C(=O)[OH]\")])"
]
},
{
"cell_type": "markdown",
"id": "a60e242b",
"metadata": {},
"source": [
"We can also expand the already fitted pipeline, how about creating a pipeline that can predict directly from SMILES? With scikit-mol that is easy!"
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "a908097d",
"metadata": {
"execution": {
"iopub.execute_input": "2025-05-08T16:22:37.630016Z",
"iopub.status.busy": "2025-05-08T16:22:37.629320Z",
"iopub.status.idle": "2025-05-08T16:22:37.640274Z",
"shell.execute_reply": "2025-05-08T16:22:37.639075Z"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Pipeline(steps=[('smiles_transformer', SmilesToMolTransformer()),\n",
" ('pipe',\n",
" Pipeline(steps=[('mol_transformer',\n",
" MorganFingerprintTransformer()),\n",
" ('Regressor', Ridge())]))])\n"
]
}
],
"source": [
"smiles_pipe = Pipeline(\n",
" [(\"smiles_transformer\", SmilesToMolTransformer()), (\"pipe\", pipe)]\n",
")\n",
"print(smiles_pipe)"
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "0124653c",
"metadata": {
"execution": {
"iopub.execute_input": "2025-05-08T16:22:37.643282Z",
"iopub.status.busy": "2025-05-08T16:22:37.642781Z",
"iopub.status.idle": "2025-05-08T16:22:37.655561Z",
"shell.execute_reply": "2025-05-08T16:22:37.652513Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"array([6.00400299])"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"smiles_pipe.predict([\"c1ccccc1C(=O)[OH]\"])"
]
},
{
"cell_type": "markdown",
"id": "069e2d01",
"metadata": {},
"source": [
"From here, the pipelines could be pickled, and later loaded for easy prediction on RDKit molecule objects or SMILES in other scripts. The transformation with the MorganTransformer will be the same as during fitting, so no need to remember if radius 2 or 3 was used for this or that model, as it is already in the pipeline itself. If we need to see the parameters for a particular pipeline of model, we can always get the non default settings via print or all settings with .get_params()."
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "63c8ef60",
"metadata": {
"execution": {
"iopub.execute_input": "2025-05-08T16:22:37.659747Z",
"iopub.status.busy": "2025-05-08T16:22:37.658755Z",
"iopub.status.idle": "2025-05-08T16:22:37.669836Z",
"shell.execute_reply": "2025-05-08T16:22:37.668406Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"{'memory': None,\n",
" 'steps': [('smiles_transformer', SmilesToMolTransformer()),\n",
" ('pipe',\n",
" Pipeline(steps=[('mol_transformer', MorganFingerprintTransformer()),\n",
" ('Regressor', Ridge())]))],\n",
" 'transform_input': None,\n",
" 'verbose': False,\n",
" 'smiles_transformer': SmilesToMolTransformer(),\n",
" 'pipe': Pipeline(steps=[('mol_transformer', MorganFingerprintTransformer()),\n",
" ('Regressor', Ridge())]),\n",
" 'smiles_transformer__n_jobs': None,\n",
" 'smiles_transformer__safe_inference_mode': False,\n",
" 'pipe__memory': None,\n",
" 'pipe__steps': [('mol_transformer', MorganFingerprintTransformer()),\n",
" ('Regressor', Ridge())],\n",
" 'pipe__transform_input': None,\n",
" 'pipe__verbose': False,\n",
" 'pipe__mol_transformer': MorganFingerprintTransformer(),\n",
" 'pipe__Regressor': Ridge(),\n",
" 'pipe__mol_transformer__fpSize': 2048,\n",
" 'pipe__mol_transformer__n_jobs': None,\n",
" 'pipe__mol_transformer__radius': 2,\n",
" 'pipe__mol_transformer__safe_inference_mode': False,\n",
" 'pipe__mol_transformer__useBondTypes': True,\n",
" 'pipe__mol_transformer__useChirality': False,\n",
" 'pipe__mol_transformer__useCounts': False,\n",
" 'pipe__mol_transformer__useFeatures': False,\n",
" 'pipe__Regressor__alpha': 1.0,\n",
" 'pipe__Regressor__copy_X': True,\n",
" 'pipe__Regressor__fit_intercept': True,\n",
" 'pipe__Regressor__max_iter': None,\n",
" 'pipe__Regressor__positive': False,\n",
" 'pipe__Regressor__random_state': None,\n",
" 'pipe__Regressor__solver': 'auto',\n",
" 'pipe__Regressor__tol': 0.0001}"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"smiles_pipe.get_params()"
]
}
],
"metadata": {
"jupytext": {
"formats": "docs//notebooks//ipynb,docs//notebooks//scripts//py:percent"
},
"kernelspec": {
"display_name": ".venv",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.6"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
================================================
FILE: docs/notebooks/04_standardizer.ipynb
================================================
{
"cells": [
{
"cell_type": "markdown",
"id": "095e3de9",
"metadata": {},
"source": [
"# Molecule standardization\n",
"When building machine learning models of molecules, it is important to standardize the molecules. We often don't want different predictions just because things are drawn in slightly different forms, such as protonated or deprotanted carboxylic acids. Scikit-mol provides a very basic standardize transformer based on the molvs implementation in RDKit"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "d40bdabe",
"metadata": {
"execution": {
"iopub.execute_input": "2025-05-08T16:22:39.191239Z",
"iopub.status.busy": "2025-05-08T16:22:39.189891Z",
"iopub.status.idle": "2025-05-08T16:22:40.514182Z",
"shell.execute_reply": "2025-05-08T16:22:40.512920Z"
}
},
"outputs": [],
"source": [
"from rdkit import Chem\n",
"from scikit_mol.standardizer import Standardizer\n",
"from scikit_mol.fingerprints import MorganFingerprintTransformer\n",
"from scikit_mol.conversions import SmilesToMolTransformer\n",
"from sklearn.pipeline import make_pipeline\n",
"from sklearn.linear_model import Ridge"
]
},
{
"cell_type": "markdown",
"id": "1f739296",
"metadata": {},
"source": [
"For demonstration let's create some molecules with different protonation states. The two first molecules are Benzoic acid and Sodium benzoate."
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "5a45dfd5",
"metadata": {
"execution": {
"iopub.execute_input": "2025-05-08T16:22:40.518530Z",
"iopub.status.busy": "2025-05-08T16:22:40.517654Z",
"iopub.status.idle": "2025-05-08T16:22:40.537037Z",
"shell.execute_reply": "2025-05-08T16:22:40.535847Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"array([<rdkit.Chem.rdchem.Mol object at 0x7f3682927ca0>], dtype=object)"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"array([<rdkit.Chem.rdchem.Mol object at 0x7f3682927bc0>], dtype=object)"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"smiles_strings = (\n",
" \"c1ccccc1C(=O)[OH]\",\n",
" \"c1ccccc1C(=O)[O-].[Na+]\",\n",
" \"CC[NH+](C)C\",\n",
" \"CC[N+](C)(C)C\",\n",
" \"[O-]CC(C(=O)[O-])C[NH+](C)C\",\n",
" \"[O-]CC(C(=O)[O-])C[N+](C)(C)C\",\n",
")\n",
"\n",
"smi2mol = SmilesToMolTransformer()\n",
"\n",
"mols = smi2mol.transform(smiles_strings)\n",
"for mol in mols[0:2]:\n",
" display(mol)"
]
},
{
"cell_type": "markdown",
"id": "1974e56a",
"metadata": {},
"source": [
"We can simply use the transformer directly and get a list of standardized molecules"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "d13141c6",
"metadata": {
"execution": {
"iopub.execute_input": "2025-05-08T16:22:40.540032Z",
"iopub.status.busy": "2025-05-08T16:22:40.539703Z",
"iopub.status.idle": "2025-05-08T16:22:40.560979Z",
"shell.execute_reply": "2025-05-08T16:22:40.560007Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"array([['O=C(O)c1ccccc1'],\n",
" ['O=C(O)c1ccccc1'],\n",
" ['CCN(C)C'],\n",
" ['CC[N+](C)(C)C'],\n",
" ['CN(C)CC(CO)C(=O)O'],\n",
" ['C[N+](C)(C)CC(CO)C(=O)[O-]']], dtype='<U26')"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# You can just run straight up like this. Note that neutralising is optional\n",
"standardizer = Standardizer()\n",
"standard_mols = standardizer.transform(mols)\n",
"standard_smiles = smi2mol.inverse_transform(standard_mols)\n",
"standard_smiles"
]
},
{
"cell_type": "markdown",
"id": "d268d331",
"metadata": {},
"source": [
"Some of the molecules were desalted and neutralized.\n",
"\n",
"A typical use case would be to add the standardizer to a pipeline for prediction"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "a376a759",
"metadata": {
"execution": {
"iopub.execute_input": "2025-05-08T16:22:40.565099Z",
"iopub.status.busy": "2025-05-08T16:22:40.564746Z",
"iopub.status.idle": "2025-05-08T16:22:40.603278Z",
"shell.execute_reply": "2025-05-08T16:22:40.602109Z"
},
"lines_to_next_cell": 2
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Predictions with no standardization: [0.51983795 0.61543701 2.31738354 3.01206795 3.44085399 4.37516731]\n",
"Predictions with standardization: [0.51983795 0.51983795 2.06562022 3.01206795 3.95446692 4.92816899]\n"
]
}
],
"source": [
"# Typical use case is to use it in an sklearn pipeline, like below\n",
"predictor = Ridge()\n",
"\n",
"std_pipe = make_pipeline(\n",
" SmilesToMolTransformer(),\n",
" Standardizer(),\n",
" MorganFingerprintTransformer(useCounts=True),\n",
" predictor,\n",
")\n",
"nonstd_pipe = make_pipeline(\n",
" SmilesToMolTransformer(), MorganFingerprintTransformer(useCounts=True), predictor\n",
")\n",
"\n",
"fake_y = range(len(smiles_strings))\n",
"\n",
"std_pipe.fit(smiles_strings, fake_y)\n",
"\n",
"\n",
"print(f\"Predictions with no standardization: {nonstd_pipe.predict(smiles_strings)}\")\n",
"print(f\"Predictions with standardization: {std_pipe.predict(smiles_strings)}\")"
]
},
{
"cell_type": "markdown",
"id": "f0d071fb",
"metadata": {},
"source": [
"As we can see, the predictions with the standardizer and without are different. The two first molecules were benzoic acid and sodium benzoate, which with the standardized pipeline is predicted as the same, but differently with the nonstandardized pipeline. Whether we want to make the prediction on the parent compound, or predict the exact form, will of course depend on the use-case, but now there is at least a way to handle it easily in pipelined predictors.\n",
"\n",
"The example also demonstrate another feature. We created the ridge regressor before creating the two pipelines. Fitting one of the pipelines thus also updated the object in the other pipeline. This can be useful for building inference pipelines that takes in SMILES molecules, but rather do the fitting on already converted and standardized molecules. However, be aware that the crossvalidation classes of scikit-learn may clone the estimators internally when doing the search loop, which would break this interdependence, and necessitate the rebuilding of the inference pipeline.\n",
"\n",
"If we had fitted the non standardizing pipeline, the model would have been different as shown below, as some of the molecules would be perceived different by the Ridge regressor."
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "50f71bca",
"metadata": {
"execution": {
"iopub.execute_input": "2025-05-08T16:22:40.606383Z",
"iopub.status.busy": "2025-05-08T16:22:40.605699Z",
"iopub.status.idle": "2025-05-08T16:22:40.631525Z",
"shell.execute_reply": "2025-05-08T16:22:40.630106Z"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Predictions with no standardization: [0.07445775 0.96053374 2.05993278 3.00857908 3.96365443 4.93284221]\n",
"Predictions with standardization: [0.07445775 0.07445775 2.32132164 3.00857908 2.68502208 4.30275549]\n"
]
}
],
"source": [
"nonstd_pipe.fit(smiles_strings, fake_y)\n",
"print(f\"Predictions with no standardization: {nonstd_pipe.predict(smiles_strings)}\")\n",
"print(f\"Predictions with standardization: {std_pipe.predict(smiles_strings)}\")"
]
}
],
"metadata": {
"jupytext": {
"formats": "docs//notebooks//ipynb,docs//notebooks//scripts//py:percent"
},
"kernelspec": {
"display_name": "Python 3.9.4 ('rdkit')",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.6"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
================================================
FILE: docs/notebooks/05_smiles_sanitization.ipynb
================================================
{
"cells": [
{
"cell_type": "markdown",
"id": "9b787560",
"metadata": {},
"source": [
"# SMILES sanitation\n",
"Sometimes we are faced with datasets which has SMILES that rdkit doesn't want to sanitize. This can be human entry errors, or differences between RDKits more strict sanitazion and other toolkits implementations of the parser. e.g. RDKit will not handle a tetravalent nitrogen when it has no charge, where other toolkits may simply build the graph anyway, disregarding the issues with the valence rules or guessing that the nitrogen should have a charge, where it could also by accident instead have a methyl group too many."
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "612aa974",
"metadata": {
"execution": {
"iopub.execute_input": "2025-05-08T16:22:41.856567Z",
"iopub.status.busy": "2025-05-08T16:22:41.856271Z",
"iopub.status.idle": "2025-05-08T16:22:42.443540Z",
"shell.execute_reply": "2025-05-08T16:22:42.442130Z"
},
"lines_to_next_cell": 2
},
"outputs": [],
"source": [
"import pandas as pd\n",
"from rdkit.Chem import PandasTools\n",
"\n",
"csv_file = \"../../tests/data/SLC6A4_active_excapedb_subset.csv\" # Hmm, maybe better to download directly\n",
"data = pd.read_csv(csv_file)"
]
},
{
"cell_type": "markdown",
"id": "0f957a69",
"metadata": {},
"source": [
"Now, this example dataset contain all sanitizable SMILES, so for demonstration purposes, we will corrupt one of them"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "b09cfd6b",
"metadata": {
"execution": {
"iopub.execute_input": "2025-05-08T16:22:42.448114Z",
"iopub.status.busy": "2025-05-08T16:22:42.447423Z",
"iopub.status.idle": "2025-05-08T16:22:42.454532Z",
"shell.execute_reply": "2025-05-08T16:22:42.453410Z"
}
},
"outputs": [],
"source": [
"data.loc[1, \"SMILES\"] = \"CN(C)(C)(C)\""
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "e20fb5cc",
"metadata": {
"execution": {
"iopub.execute_input": "2025-05-08T16:22:42.458752Z",
"iopub.status.busy": "2025-05-08T16:22:42.457870Z",
"iopub.status.idle": "2025-05-08T16:22:42.522970Z",
"shell.execute_reply": "2025-05-08T16:22:42.521865Z"
},
"lines_to_next_cell": 2
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Dataset contains 1 unparsable mols\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"[18:22:42] Explicit valence for atom # 1 N, 4, is greater than permitted\n"
]
}
],
"source": [
"PandasTools.AddMoleculeColumnToFrame(data, smilesCol=\"SMILES\")\n",
"print(f\"Dataset contains {data.ROMol.isna().sum()} unparsable mols\")"
]
},
{
"cell_type": "markdown",
"id": "f8dccd93",
"metadata": {},
"source": [
"If we use these SMILES for the scikit-learn pipeline, we would face an error, so we need to check and clean the dataset first. The CheckSmilesSanitation can help us with that."
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "3dbd50b3",
"metadata": {
"execution": {
"iopub.execute_input": "2025-05-08T16:22:42.526969Z",
"iopub.status.busy": "2025-05-08T16:22:42.526369Z",
"iopub.status.idle": "2025-05-08T16:22:43.317088Z",
"shell.execute_reply": "2025-05-08T16:22:43.316227Z"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Error in parsing 1 SMILES. Unparsable SMILES can be found in self.errors\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"[18:22:43] Explicit valence for atom # 1 N, 4, is greater than permitted\n"
]
}
],
"source": [
"from scikit_mol.utilities import CheckSmilesSanitization\n",
"\n",
"smileschecker = CheckSmilesSanitization()\n",
"\n",
"smiles_list_valid, y_valid, smiles_errors, y_errors = smileschecker.sanitize(\n",
" list(data.SMILES), list(data.pXC50)\n",
")"
]
},
{
"cell_type": "markdown",
"id": "c888d7da",
"metadata": {},
"source": [
"Now the smiles_list_valid should be all valid and the y_values filtered as well. Errors are returned, but also accessible after the call to .sanitize() in the .errors property"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "5af5ea3d",
"metadata": {
"execution": {
"iopub.execute_input": "2025-05-08T16:22:43.320958Z",
"iopub.status.busy": "2025-05-08T16:22:43.320157Z",
"iopub.status.idle": "2025-05-08T16:22:43.335067Z",
"shell.execute_reply": "2025-05-08T16:22:43.333676Z"
}
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>SMILES</th>\n",
" <th>y</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>CN(C)(C)(C)</td>\n",
" <td>7.18046</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" SMILES y\n",
"0 CN(C)(C)(C) 7.18046"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"smileschecker.errors"
]
},
{
"cell_type": "markdown",
"id": "c2ce2677",
"metadata": {},
"source": [
"The checker can also be used only on X"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "84db07cc",
"metadata": {
"execution": {
"iopub.execute_input": "2025-05-08T16:22:43.339302Z",
"iopub.status.busy": "2025-05-08T16:22:43.338668Z",
"iopub.status.idle": "2025-05-08T16:22:43.391019Z",
"shell.execute_reply": "2025-05-08T16:22:43.389989Z"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Error in parsing 1 SMILES. Unparsable SMILES can be found in self.errors\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"[18:22:43] Explicit valence for atom # 1 N, 4, is greater than permitted\n"
]
},
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>SMILES</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>CN(C)(C)(C)</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" SMILES\n",
"0 CN(C)(C)(C)"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"smiles_list_valid, X_errors = smileschecker.sanitize(list(data.SMILES))\n",
"smileschecker.errors"
]
}
],
"metadata": {
"jupytext": {
"formats": "docs//notebooks//ipynb,docs//notebooks//scripts//py:percent"
},
"kernelspec": {
"display_name": "Python 3.9.4 ('rdkit')",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.6"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
================================================
FILE: docs/notebooks/06_hyperparameter_tuning.ipynb
================================================
{
"cells": [
{
"cell_type": "markdown",
"id": "f0b0cc54",
"metadata": {},
"source": [
"# Full example: Hyperparameter tuning\n",
"\n",
"first some imports of the usual suspects: RDKit, pandas, matplotlib, numpy and sklearn. New kid on the block is scikit-mol"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "51aa3d62",
"metadata": {
"execution": {
"iopub.execute_input": "2025-05-08T16:22:44.604531Z",
"iopub.status.busy": "2025-05-08T16:22:44.604218Z",
"iopub.status.idle": "2025-05-08T16:22:46.163842Z",
"shell.execute_reply": "2025-05-08T16:22:46.162418Z"
}
},
"outputs": [],
"source": [
"import os\n",
"import rdkit\n",
"from rdkit import Chem\n",
"from rdkit.Chem import PandasTools\n",
"import pandas as pd\n",
"import matplotlib.pyplot as plt\n",
"from time import time\n",
"import numpy as np\n",
"from sklearn.pipeline import Pipeline, make_pipeline\n",
"from sklearn.linear_model import Ridge\n",
"from sklearn.model_selection import train_test_split\n",
"from scikit_mol.fingerprints import MorganFingerprintTransformer\n",
"from scikit_mol.conversions import SmilesToMolTransformer"
]
},
{
"cell_type": "markdown",
"id": "e07990d0",
"metadata": {},
"source": [
"We will need some data. There is a dataset with the SLC6A4 active compounds from ExcapeDB on Zenodo. The scikit-mol project uses a subset of this for testing, and the samples there has been specially selected to give good results in testing (it should therefore be used for any production modelling). If full_set is false, the fast subset will be used, and otherwise the full dataset will be downloaded if needed."
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "adbc1868",
"metadata": {
"execution": {
"iopub.execute_input": "2025-05-08T16:22:46.167928Z",
"iopub.status.busy": "2025-05-08T16:22:46.166905Z",
"iopub.status.idle": "2025-05-08T16:22:46.173404Z",
"shell.execute_reply": "2025-05-08T16:22:46.172138Z"
}
},
"outputs": [],
"source": [
"full_set = False\n",
"\n",
"if full_set:\n",
" csv_file = \"SLC6A4_active_excape_export.csv\"\n",
" if not os.path.exists(csv_file):\n",
" import urllib.request\n",
"\n",
" url = \"https://ndownloader.figshare.com/files/25747817\"\n",
" urllib.request.urlretrieve(url, csv_file)\n",
"else:\n",
" csv_file = \"../../tests/data/SLC6A4_active_excapedb_subset.csv\""
]
},
{
"cell_type": "markdown",
"id": "d2ce3c7f",
"metadata": {},
"source": [
"The CSV data is loaded into a Pandas dataframe and the PandasTools utility from RDKit is used to add a column with RDKit molecules"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "9a283f12",
"metadata": {
"execution": {
"iopub.execute_input": "2025-05-08T16:22:46.177164Z",
"iopub.status.busy": "2025-05-08T16:22:46.176440Z",
"iopub.status.idle": "2025-05-08T16:22:46.233488Z",
"shell.execute_reply": "2025-05-08T16:22:46.232374Z"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"0 out of 200 SMILES failed in conversion\n"
]
}
],
"source": [
"data = pd.read_csv(csv_file)\n",
"\n",
"PandasTools.AddMoleculeColumnToFrame(data, smilesCol=\"SMILES\")\n",
"print(f\"{data.ROMol.isna().sum()} out of {len(data)} SMILES failed in conversion\")"
]
},
{
"cell_type": "markdown",
"id": "e245e989",
"metadata": {},
"source": [
"We use the train_test_split to, well, split the dataframe's molecule columns and pXC50 column into lists for train and testing"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "303b83de",
"metadata": {
"execution": {
"iopub.execute_input": "2025-05-08T16:22:46.236917Z",
"iopub.status.busy": "2025-05-08T16:22:46.236251Z",
"iopub.status.idle": "2025-05-08T16:22:46.243264Z",
"shell.execute_reply": "2025-05-08T16:22:46.242175Z"
},
"lines_to_next_cell": 2
},
"outputs": [],
"source": [
"mol_list_train, mol_list_test, y_train, y_test = train_test_split(\n",
" data.ROMol, data.pXC50, random_state=42\n",
")"
]
},
{
"cell_type": "markdown",
"id": "56247c3b",
"metadata": {},
"source": [
"We will standardize the molecules before modelling. This is best done before the hyperparameter optimization of the featurization with the scikit-mol transformer and regression modelling, as the standardization is otherwise done for every loop in the hyperparameter optimization, which will make it take longer time."
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "1383d0fc",
"metadata": {
"execution": {
"iopub.execute_input": "2025-05-08T16:22:46.247777Z",
"iopub.status.busy": "2025-05-08T16:22:46.246787Z",
"iopub.status.idle": "2025-05-08T16:22:46.614634Z",
"shell.execute_reply": "2025-05-08T16:22:46.613399Z"
}
},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/home/anton/projects/scikit-mol/.venv/lib/python3.12/site-packages/numpy/_core/fromnumeric.py:57: FutureWarning: 'Series.swapaxes' is deprecated and will be removed in a future version. Please use 'Series.transpose' instead.\n",
" return bound(*args, **kwds)\n"
]
}
],
"source": [
"# Probably the recommended way would be to prestandardize the data if there's no changes to the transformer,\n",
"# and then add the standardizer in the inference pipeline.\n",
"\n",
"from scikit_mol.standardizer import Standardizer\n",
"\n",
"standardizer = Standardizer()\n",
"mol_list_std_train = standardizer.transform(mol_list_train)"
]
},
{
"cell_type": "markdown",
"id": "0775d395",
"metadata": {},
"source": [
"A simple pipeline with a MorganTransformer and a Ridge() regression for demonstration."
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "51c74711",
"metadata": {
"execution": {
"iopub.execute_input": "2025-05-08T16:22:46.618057Z",
"iopub.status.busy": "2025-05-08T16:22:46.617207Z",
"iopub.status.idle": "2025-05-08T16:22:46.622371Z",
"shell.execute_reply": "2025-05-08T16:22:46.621207Z"
},
"lines_to_next_cell": 2
},
"outputs": [],
"source": [
"moltransformer = MorganFingerprintTransformer()\n",
"regressor = Ridge()\n",
"\n",
"optimization_pipe = make_pipeline(moltransformer, regressor)"
]
},
{
"cell_type": "markdown",
"id": "8221a682",
"metadata": {},
"source": [
"For hyperparameter optimization we import the RandomizedSearchCV class from Scikit-Learn. It will try different random combinations of settings and use internal cross-validation to find the best model. In the end, it will fit the best found parameters on the full set. We also import loguniform, to get a better sampling of some of the parameters."
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "4c6b833f",
"metadata": {
"execution": {
"iopub.execute_input": "2025-05-08T16:22:46.625354Z",
"iopub.status.busy": "2025-05-08T16:22:46.625058Z",
"iopub.status.idle": "2025-05-08T16:22:46.629915Z",
"shell.execute_reply": "2025-05-08T16:22:46.628829Z"
},
"title": "Now hyperparameter tuning"
},
"outputs": [],
"source": [
"from sklearn.model_selection import RandomizedSearchCV\n",
"\n",
"# from sklearn.utils.fixes import loguniform\n",
"from scipy.stats import loguniform"
]
},
{
"cell_type": "markdown",
"id": "6b9d4576",
"metadata": {},
"source": [
"With the pipelines, getting the names of the parameters to tune is a bit more tricky, as they are concatenations of the name of the step and the parameter with double underscores in between. We can get the available parameters from the pipeline with the get_params() method, and select the parameters we want to change from there."
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "0af1003b",
"metadata": {
"execution": {
"iopub.execute_input": "2025-05-08T16:22:46.633716Z",
"iopub.status.busy": "2025-05-08T16:22:46.632911Z",
"iopub.status.idle": "2025-05-08T16:22:46.641844Z",
"shell.execute_reply": "2025-05-08T16:22:46.640881Z"
},
"title": "Which keys do we have?"
},
"outputs": [
{
"data": {
"text/plain": [
"dict_keys(['memory', 'steps', 'transform_input', 'verbose', 'morganfingerprinttransformer', 'ridge', 'morganfingerprinttransformer__fpSize', 'morganfingerprinttransformer__n_jobs', 'morganfingerprinttransformer__radius', 'morganfingerprinttransformer__safe_inference_mode', 'morganfingerprinttransformer__useBondTypes', 'morganfingerprinttransformer__useChirality', 'morganfingerprinttransformer__useCounts', 'morganfingerprinttransformer__useFeatures', 'ridge__alpha', 'ridge__copy_X', 'ridge__fit_intercept', 'ridge__max_iter', 'ridge__positive', 'ridge__random_state', 'ridge__solver', 'ridge__tol'])"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"optimization_pipe.get_params().keys()"
]
},
{
"cell_type": "markdown",
"id": "cb0db6a5",
"metadata": {},
"source": [
"We will tune the regularization strength of the Ridge regressor, and try out different parameters for the Morgan fingerprint, namely the number of bits, the radius of the fingerprint, wheter to use counts or bits and features."
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "c2d541b3",
"metadata": {
"execution": {
"iopub.execute_input": "2025-05-08T16:22:46.645249Z",
"iopub.status.busy": "2025-05-08T16:22:46.644964Z",
"iopub.status.idle": "2025-05-08T16:22:46.651276Z",
"shell.execute_reply": "2025-05-08T16:22:46.650210Z"
}
},
"outputs": [],
"source": [
"param_dist = {\n",
" \"ridge__alpha\": loguniform(1e-2, 1e3),\n",
" \"morganfingerprinttransformer__fpSize\": [256, 512, 1024, 2048, 4096],\n",
" \"morganfingerprinttransformer__radius\": [1, 2, 3, 4],\n",
" \"morganfingerprinttransformer__useCounts\": [True, False],\n",
" \"morganfingerprinttransformer__useFeatures\": [True, False],\n",
"}"
]
},
{
"cell_type": "markdown",
"id": "2157d154",
"metadata": {
"lines_to_next_cell": 2
},
"source": [
"The report function was taken from [this example](https://scikit-learn.org/stable/auto_examples/model_selection/plot_randomized_search.html#sphx-glr-auto-examples-model-selection-plot-randomized-search-py) from the scikit learn documentation."
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "f2c91783",
"metadata": {
"execution": {
"iopub.execute_input": "2025-05-08T16:22:46.655212Z",
"iopub.status.busy": "2025-05-08T16:22:46.654913Z",
"iopub.status.idle": "2025-05-08T16:22:46.662196Z",
"shell.execute_reply": "2025-05-08T16:22:46.661226Z"
},
"title": "From https://scikit-learn.org/stable/auto_examples/model_selection/plot_randomized_search.html#sphx-glr-auto-examples-model-selection-plot-randomized-search-py"
},
"outputs": [],
"source": [
"# Utility function to report best scores\n",
"def report(results, n_top=3):\n",
" for i in range(1, n_top + 1):\n",
" candidates = np.flatnonzero(results[\"rank_test_score\"] == i)\n",
" for candidate in candidates:\n",
" print(\"Model with rank: {0}\".format(i))\n",
" print(\n",
" \"Mean validation score: {0:.3f} (std: {1:.3f})\".format(\n",
" results[\"mean_test_score\"][candidate],\n",
" results[\"std_test_score\"][candidate],\n",
" )\n",
" )\n",
" print(\"Parameters: {0}\".format(results[\"params\"][candidate]))\n",
" print(\"\")"
]
},
{
"cell_type": "markdown",
"id": "469691f4",
"metadata": {},
"source": [
"We will do 25 tries of random parameter sets, and see what comes out as the best one. If you are using the small example dataset, this should take some second, but may take some minutes with the full set."
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "79a70a0f",
"metadata": {
"execution": {
"iopub.execute_input": "2025-05-08T16:22:46.665390Z",
"iopub.status.busy": "2025-05-08T16:22:46.665100Z",
"iopub.status.idle": "2025-05-08T16:22:49.120359Z",
"shell.execute_reply": "2025-05-08T16:22:49.119269Z"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Runtime: 2.45 for 25 iterations)\n"
]
}
],
"source": [
"n_iter_search = 25\n",
"random_search = RandomizedSearchCV(\n",
" optimization_pipe, param_distributions=param_dist, n_iter=n_iter_search, cv=3\n",
")\n",
"t0 = time()\n",
"random_search.fit(mol_list_std_train, y_train.values)\n",
"t1 = time()\n",
"\n",
"print(f\"Runtime: {t1-t0:0.2F} for {n_iter_search} iterations)\")"
]
},
{
"cell_type": "code",
"execution_count": 12,
"id": "b6160cb3",
"metadata": {
"execution": {
"iopub.execute_input": "2025-05-08T16:22:49.124324Z",
"iopub.status.busy": "2025-05-08T16:22:49.123579Z",
"iopub.status.idle": "2025-05-08T16:22:49.130023Z",
"shell.execute_reply": "2025-05-08T16:22:49.128965Z"
},
"lines_to_next_cell": 0
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Model with rank: 1\n",
"Mean validation score: 0.459 (std: 0.117)\n",
"Parameters: {'morganfingerprinttransformer__fpSize': 1024, 'morganfingerprinttransformer__radius': 3, 'morganfingerprinttransformer__useCounts': False, 'morganfingerprinttransformer__useFeatures': True, 'ridge__alpha': np.float64(11.211371939288233)}\n",
"\n",
"Model with rank: 2\n",
"Mean validation score: 0.427 (std: 0.130)\n",
"Parameters: {'morganfingerprinttransformer__fpSize': 512, 'morganfingerprinttransformer__radius': 2, 'morganfingerprinttransformer__useCounts': True, 'morganfingerprinttransformer__useFeatures': False, 'ridge__alpha': np.float64(22.96332964984786)}\n",
"\n",
"Model with rank: 3\n",
"Mean validation score: 0.426 (std: 0.166)\n",
"Parameters: {'morganfingerprinttransformer__fpSize': 4096, 'morganfingerprinttransformer__radius': 2, 'morganfingerprinttransformer__useCounts': True, 'morganfingerprinttransformer__useFeatures': False, 'ridge__alpha': np.float64(23.874114087368742)}\n",
"\n"
]
}
],
"source": [
"report(random_search.cv_results_)"
]
},
{
"cell_type": "markdown",
"id": "9a2ea219",
"metadata": {},
"source": [
"It can be interesting to see what combinations of hyperparameters gave good results for the cross-validation. Usually the number of bits are in the high end and radius is 2 to 4. But this can vary a bit, as we do a small number of tries for this demo. More extended search with more iterations could maybe find even better and more consistent. solutions"
]
},
{
"cell_type": "markdown",
"id": "6cf91582",
"metadata": {},
"source": [
"Let's see if standardization had any influence on this dataset. We build an inference pipeline that includes the standardization object and the best estimator, and run the best estimator directly on the list of test molecules"
]
},
{
"cell_type": "code",
"execution_count": 13,
"id": "4daaf106",
"metadata": {
"execution": {
"iopub.execute_input": "2025-05-08T16:22:49.133255Z",
"iopub.status.busy": "2025-05-08T16:22:49.132589Z",
"iopub.status.idle": "2025-05-08T16:22:49.294436Z",
"shell.execute_reply": "2025-05-08T16:22:49.293304Z"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"No Standardization 0.4921\n",
"With Standardization 0.4921\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"/home/anton/projects/scikit-mol/.venv/lib/python3.12/site-packages/numpy/_core/fromnumeric.py:57: FutureWarning: 'Series.swapaxes' is deprecated and will be removed in a future version. Please use 'Series.transpose' instead.\n",
" return bound(*args, **kwds)\n",
"/home/anton/projects/scikit-mol/.venv/lib/python3.12/site-packages/numpy/_core/fromnumeric.py:57: FutureWarning: 'Series.swapaxes' is deprecated and will be removed in a future version. Please use 'Series.transpose' instead.\n",
" return bound(*args, **kwds)\n"
]
}
],
"source": [
"inference_pipe = make_pipeline(standardizer, random_search.best_estimator_)\n",
"\n",
"print(\n",
" f\"No Standardization {random_search.best_estimator_.score(mol_list_test, y_test):0.4F}\"\n",
")\n",
"print(f\"With Standardization {inference_pipe.score(mol_list_test, y_test):0.4F}\")"
]
},
{
"cell_type": "markdown",
"id": "2d31c059",
"metadata": {
"lines_to_next_cell": 0,
"title": "Building an inference pipeline, it appears our test-data was pretty standard"
},
"source": [
"We see that the dataset already appeared to be in forms that are similar to the ones coming from the standardization.\n",
"\n",
"Interestingly the test-set performance often seem to be better than the CV performance during the hyperparameter search. This may be due to the model being refit at the end of the search to the whole training dataset, as the refit parameter on the randomized_search object by default is true. The final model is thus fitted on more data than the individual models during training.\n",
"\n",
"To demonstrate the effect of standartization we can see the difference if we challenge the predictor with different forms of benzoic acid and benzoates."
]
},
{
"cell_type": "code",
"execution_count": 14,
"id": "92105568",
"metadata": {
"execution": {
"iopub.execute_input": "2025-05-08T16:22:49.297625Z",
"iopub.status.busy": "2025-05-08T16:22:49.297285Z",
"iopub.status.idle": "2025-05-08T16:22:49.318086Z",
"shell.execute_reply": "2025-05-08T16:22:49.316957Z"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Predictions with no standardization: [6.36710496 6.49711427 6.49711427 6.28330625 6.72697401]\n",
"Predictions with standardization: [6.36710496 6.36710496 6.36710496 6.36710496 6.36710496]\n"
]
}
],
"source": [
"# Intergrating the Standardizer and challenge it with some different forms and salts of benzoic acid\n",
"smiles_list = [\n",
" \"c1ccccc1C(=O)[OH]\",\n",
" \"c1ccccc1C(=O)[O-]\",\n",
" \"c1ccccc1C(=O)[O-].[Na+]\",\n",
" \"c1ccccc1C(=O)[O][Na]\",\n",
" \"c1ccccc1C(=O)[O-].C[N+](C)C\",\n",
"]\n",
"mols_list = [Chem.MolFromSmiles(smiles) for smiles in smiles_list]\n",
"\n",
"print(\n",
" f\"Predictions with no standardization: {random_search.best_estimator_.predict(mols_list)}\"\n",
")\n",
"print(f\"Predictions with standardization: {inference_pipe.predict(mols_list)}\")"
]
},
{
"cell_type": "markdown",
"id": "9d196197",
"metadata": {},
"source": [
"Without standardization we get variation in the predictions, but with the standardization object in place, we get the same results. If you want a model that gives different predictions for the different forms, either the standardization need to be removed or the settings changed.\n",
"\n",
"From here it should be easy to save the model using pickle, so that it can be loaded and used in other python projects. The pipeline carries both the standardization, the featurization and the prediction in one, easy to reuse object. If you want the model to be able to predict directly from SMILES strings, check out the SmilesToMol class, which is also available in Scikit-Mol :-)\n"
]
},
{
"cell_type": "markdown",
"id": "824ebc99",
"metadata": {},
"source": []
}
],
"metadata": {
"jupytext": {
"formats": "docs//notebooks//ipynb,docs//notebooks//scripts//py:percent"
},
"kernelspec": {
"display_name": "Python 3.9.4 ('rdkit')",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.6"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
================================================
FILE: docs/notebooks/07_parallel_transforms.ipynb
================================================
{
"cells": [
{
"cell_type": "code",
"execution_count": null,
"id": "6f68fb8e",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"id": "87ed8373",
"metadata": {},
"source": [
"# Parallel calculations of transforms\n",
"\n",
"Scikit-mol supports parallel calculations of fingerprints and descriptors. This feature can be activated and configured using the `n_jobs` parameter or the `.n_jobs` attribute after object instantiation.\n",
"\n",
"To begin, let's import the necessary libraries: RDKit and pandas. And of course, we'll also need to import scikit-mol, which is the new kid on the block."
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "dac6956a",
"metadata": {
"execution": {
"iopub.execute_input": "2024-11-24T09:27:38.302600Z",
"iopub.status.busy": "2024-11-24T09:27:38.302116Z",
"iopub.status.idle": "2024-11-24T09:27:39.171522Z",
"shell.execute_reply": "2024-11-24T09:27:39.170882Z"
}
},
"outputs": [],
"source": [
"import pathlib\n",
"import time\n",
"\n",
"import pandas as pd\n",
"from rdkit.Chem import PandasTools\n",
"\n",
"from scikit_mol.conversions import SmilesToMolTransformer\n",
"from scikit_mol.descriptors import MolecularDescriptorTransformer\n",
"from scikit_mol.fingerprints import MorganFingerprintTransformer"
]
},
{
"cell_type": "markdown",
"id": "7c2a81f2",
"metadata": {},
"source": [
"## Obtaining the Data\n",
"\n",
"We'll need some data to work with, so we'll use a dataset of SLC6A4 active compounds from ExcapeDB that is available on Zenodo. Scikit-mol uses a subset of this dataset for testing purposes, and the samples have been specially selected to provide good results in testing. Note: This dataset should never be used for production modeling.\n",
"\n",
"In the code below, you can set full_set to True to download the full dataset. Otherwise, the smaller dataset will be used."
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "f64c418f",
"metadata": {
"execution": {
"iopub.execute_input": "2024-11-24T09:27:39.174368Z",
"iopub.status.busy": "2024-11-24T09:27:39.174075Z",
"iopub.status.idle": "2024-11-24T09:27:39.177863Z",
"shell.execute_reply": "2024-11-24T09:27:39.177305Z"
}
},
"outputs": [],
"source": [
"full_set = False\n",
"\n",
"if full_set:\n",
" csv_file = \"SLC6A4_active_excape_export.csv\"\n",
" if not pathlib.Path(csv_file).exists():\n",
" import urllib.request\n",
"\n",
" url = \"https://ndownloader.figshare.com/files/25747817\"\n",
" urllib.request.urlretrieve(url, csv_file)\n",
"else:\n",
" csv_file = \"../tests/data/SLC6A4_active_excapedb_subset.csv\""
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "0eabd800",
"metadata": {
"execution": {
"iopub.execute_input": "2024-11-24T09:27:39.180191Z",
"iopub.status.busy": "2024-11-24T09:27:39.179937Z",
"iopub.status.idle": "2024-11-24T09:27:39.221096Z",
"shell.execute_reply": "2024-11-24T09:27:39.220386Z"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"0 out of 200 SMILES failed in conversion\n"
]
}
],
"source": [
"data = pd.read_csv(csv_file)\n",
"\n",
"PandasTools.AddMoleculeColumnToFrame(data, smilesCol=\"SMILES\")\n",
"print(f\"{data.ROMol.isna().sum()} out of {len(data)} SMILES failed in conversion\")"
]
},
{
"cell_type": "markdown",
"id": "4144946e",
"metadata": {},
"source": [
"## Evaluating the Impact of Parallelism on Transformations\n",
"\n",
"Let's start by creating a baseline for our calculations without using parallelism."
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "a7f66af7",
"metadata": {
"execution": {
"iopub.execute_input": "2024-11-24T09:27:39.223702Z",
"iopub.status.busy": "2024-11-24T09:27:39.223459Z",
"iopub.status.idle": "2024-11-24T09:27:39.228461Z",
"shell.execute_reply": "2024-11-24T09:27:39.227977Z"
},
"title": "A demonstration of the speedup that can be had for the descriptor transformer"
},
"outputs": [],
"source": [
"transformer = MolecularDescriptorTransformer(n_jobs=1)"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "a03bc824",
"metadata": {
"execution": {
"iopub.execute_input": "2024-11-24T09:27:39.230911Z",
"iopub.status.busy": "2024-11-24T09:27:39.230692Z",
"iopub.status.idle": "2024-11-24T09:27:41.368180Z",
"shell.execute_reply": "2024-11-24T09:27:41.367438Z"
}
},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/home/anton/scikit-mol/.venv/lib/python3.9/site-packages/numpy/_core/fromnumeric.py:57: FutureWarning: 'Series.swapaxes' is deprecated and will be removed in a future version. Please use 'Series.transpose' instead.\n",
" return bound(*args, **kwds)\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Calculation time on dataset of size 200 with parallel=1:\t2.10 seconds\n"
]
}
],
"source": [
"def test_transformer(transformer):\n",
" t0 = time.time()\n",
" transformer.transform(data.ROMol)\n",
" t = time.time() - t0\n",
" print(\n",
" f\"Calculation time on dataset of size {len(data)} with n_jobs={transformer.n_jobs}:\\t{t:0.2F} seconds\"\n",
" )\n",
"\n",
"\n",
"test_transformer(transformer)"
]
},
{
"cell_type": "markdown",
"id": "d304d675",
"metadata": {},
"source": [
"\n",
"Let's see if parallelism can help us speed up our transformations."
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "c80388e6",
"metadata": {
"execution": {
"iopub.execute_input": "2024-11-24T09:27:41.370886Z",
"iopub.status.busy": "2024-11-24T09:27:41.370638Z",
"iopub.status.idle": "2024-11-24T09:27:42.384085Z",
"shell.execute_reply": "2024-11-24T09:27:42.383188Z"
}
},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/home/anton/scikit-mol/.venv/lib/python3.9/site-packages/numpy/_core/fromnumeric.py:57: FutureWarning: 'Series.swapaxes' is deprecated and will be removed in a future version. Please use 'Series.transpose' instead.\n",
" return bound(*args, **kwds)\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Calculation time on dataset of size 200 with parallel=2:\t2.19 seconds\n"
]
}
],
"source": [
"transformer = MolecularDescriptorTransformer(n_jobs=2)\n",
"test_transformer(transformer)"
]
},
{
"cell_type": "markdown",
"id": "731bd13a",
"metadata": {},
"source": [
"We've seen that parallelism can help speed up our transformations, with the degree of speedup depending on the number of CPU cores available. However, it's worth noting that there may be some overhead associated with the process of splitting the dataset, pickling objects and functions, and passing them to the parallel child processes. As a result, it may not always be worthwhile to use parallelism, particularly for smaller datasets or certain types of fingerprints.\n",
"\n",
"It's also worth noting that there are different methods for creating the child processes, with the default method on Linux being 'fork', while on Mac and Windows it's 'spawn'. The code we're using has been tested on Linux using the 'fork' method.\n",
"\n",
"Now, let's see how parallelism impacts another type of transformer."
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "ef6d2b0c",
"metadata": {
"execution": {
"iopub.execute_input": "2024-11-24T09:27:42.387160Z",
"iopub.status.busy": "2024-11-24T09:27:42.386886Z",
"iopub.status.idle": "2024-11-24T09:27:42.484867Z",
"shell.execute_reply": "2024-11-24T09:27:42.484222Z"
},
"lines_to_next_cell": 2,
"title": "Some of the benchmarking plots"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Calculation time on dataset of size 200 with parallel=1:\t0.03 seconds\n",
"Calculation time on dataset of size 200 with parallel=2:\t0.08 seconds\n"
]
}
],
"source": [
"transformer = MorganFingerprintTransformer(n_jobs=1)\n",
"test_transformer(transformer)\n",
"transformer.n_jobs = 2\n",
"test_transformer(transformer)"
]
},
{
"cell_type": "markdown",
"id": "2aac85b1",
"metadata": {},
"source": [
"Interestingly, we observed that parallelism actually took longer to calculate the fingerprints in some cases, which is a perfect illustration of the overhead issue associated with parallelism. Generally, the faster the fingerprint calculation in itself, the larger the dataset needs to be for parallelism to be worthwhile. For example, the Descriptor transformer, which is one of the slowest, can benefit even for smaller datasets, while for faster fingerprint types like Morgan, Atompairs, and Topological Torsion fingerprints, the dataset needs to be larger.\n",
"\n",
"\n",
"\n",
"We've also included a series of plots below, showing the speedup over serial for different numbers of cores used for different dataset sizes. These timings were taken on a 16 core machine (32 Hyperthreads). Only the largest datasets (>10,000 samples) would make it worthwhile to parallelize Morgan, Atompairs, and Topological Torsions. SECfingerprint, MACCS keys, and RDKitFP are intermediate and would benefit from parallelism when the dataset size is larger, say >500. Descriptors, on the other hand, benefit almost immediately even for the smallest datasets (>100 samples).\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n"
]
},
{
"cell_type": "markdown",
"id": "0913a9f4",
"metadata": {},
"source": [
"## Performance heatmaps\n",
"\n",
"Multiprocessing performance is highly dependent on CPU performance, type of the function and the size of the dataset. To help users understand the performance of their system, we have created a series of heatmaps showing the speedup of different transformers for different dataset sizes and number of cores. The heatmaps are based on the same data as the plots above.\n",
"If you what to test the performance of your system, you can run the code below."
]
},
{
"cell_type": "code",
"execution_count": 12,
"id": "0b353728",
"metadata": {},
"outputs": [],
"source": [
"from rdkit import Chem\n",
"\n",
"from scikit_mol.fingerprints import (\n",
" AtomPairFingerprintTransformer,\n",
" AvalonFingerprintTransformer,\n",
" MHFingerprintTransformer,\n",
" MorganFingerprintTransformer,\n",
" RDKitFingerprintTransformer,\n",
" TopologicalTorsionFingerprintTransformer,\n",
")\n",
"from scikit_mol.plotting import ParallelTester, plot_heatmap\n",
"from scikit_mol.standardizer import Standardizer\n",
"\n",
"mols = [Chem.MolFromSmiles(\"CCCCCCCCBr\")] * 100000\n",
"transformers = [\n",
" Standardizer(),\n",
" MorganFingerprintTransformer(),\n",
" MolecularDescriptorTransformer(),\n",
" MHFingerprintTransformer(),\n",
" AtomPairFingerprintTransformer(),\n",
" AvalonFingerprintTransformer(),\n",
" RDKitFingerprintTransformer(),\n",
" TopologicalTorsionFingerprintTransformer(),\n",
"]"
]
},
{
"cell_type": "markdown",
"id": "47991a4f",
"metadata": {},
"source": [
"`ParallelTester` accept the following parameters:\n",
"- `transformer` - the transformer to test\n",
"- `mols` - the dataset to test\n",
"- `n_mols` - the number of molecules to test on (the largest number should be less than or equal to the number of molecules in `mols`)\n",
"- `n_cores` - the number of cores to test on (the largest number should be less than or equal to the number of cores in your system)\n",
"- `backend` - the backend to use for multiprocessing (default is `loky`, see [joblib documentation](https://joblib.readthedocs.io/en/latest/parallel.html) for more options)"
]
},
{
"cell_type": "code",
"execution_count": 13,
"id": "fa00f0eb",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>10</th>\n",
" <th>100</th>\n",
" <th>100</th>\n",
" <th>1000</th>\n",
" <th>10000</th>\n",
" <th>100000</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>0.002394</td>\n",
" <td>0.003997</td>\n",
" <td>0.003997</td>\n",
" <td>0.031515</td>\n",
" <td>0.479826</td>\n",
" <td>3.228737</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>0.012707</td>\n",
" <td>1.278691</td>\n",
" <td>1.278691</td>\n",
" <td>1.246285</td>\n",
" <td>1.507089</td>\n",
" <td>3.614131</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>1.238397</td>\n",
" <td>1.330802</td>\n",
" <td>1.330802</td>\n",
" <td>1.20337</td>\n",
" <td>1.389253</td>\n",
" <td>2.891467</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8</th>\n",
" <td>1.657188</td>\n",
" <td>1.669887</td>\n",
" <td>1.669887</td>\n",
" <td>1.601371</td>\n",
" <td>1.635777</td>\n",
" <td>2.541485</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" 10 100 100 1000 10000 100000\n",
"1 0.002394 0.003997 0.003997 0.031515 0.479826 3.228737\n",
"2 0.012707 1.278691 1.278691 1.246285 1.507089 3.614131\n",
"4 1.238397 1.330802 1.330802 1.20337 1.389253 2.891467\n",
"8 1.657188 1.669887 1.669887 1.601371 1.635777 2.541485"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"transformer = MorganFingerprintTransformer()\n",
"df = ParallelTester(transformer, mols).test()\n",
"df"
]
},
{
"cell_type": "markdown",
"id": "71725395",
"metadata": {},
"source": [
"Resulting df have one row for each `n_jobs` and one row for each `n_mols`. Results can be plotted as a heatmap using the `plot_heatmap` method."
]
},
{
"cell_type": "code",
"execution_count": 14,
"id": "6e352da5",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"<Axes: title={'center': 'Morgan Fingerprint\\nMax single-threaded speed 31731 mols/s\\nCPU: AMD Ryzen 9 4900HS with Radeon Graphics'}, xlabel='Number of mols', ylabel='Number of jobs'>"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAo0AAAJNCAYAAABURU/5AAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuNCwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8ekN5oAAAACXBIWXMAAA9hAAAPYQGoP6dpAACuK0lEQVR4nOzdd1xUx94G8GdpS1OaNBsgKIqoKDZERA2KvcdewJrYeyRFNBqJvddERaPYe8Neo1GxxN7FjgICShGEnfuHl43rLuyirKA83/vZ983OmTNnzrAuw2/KkQghBIiIiIiIsqGT1xUgIiIiovyPnUYiIiIiUoudRiIiIiJSi51GIiIiIlKLnUYiIiIiUoudRiIiIiJSi51GIiIiIlKLnUYiIiIiUoudRiIiIiJSi51GIvpiBAQEwNHRMa+r8dmMGzcOEokkr6tBRASAnUaiPBUaGgqJRAKJRIITJ04oHRdCoESJEpBIJGjWrFke1PDzcnR0lLfHh683b97kdfW+KCdPnsS4ceMQHx+f11Uhoq+EXl5XgIgAQ0NDhIWFoXbt2grpR48exePHjyGVSvOoZp+fh4cHRowYoZRuYGCAP/74AzKZLA9qlTd+/vlnjBkz5qPOPXnyJMaPH4+AgACYm5vnbsWIqEBip5EoH2jSpAk2bNiAOXPmQE/vv3+WYWFh8PT0RExMTK5dSyaTIS0tDYaGhrlWZm4qVqwYunbtqvKYjk7+HhwRQuDNmzcwMjL6pHKSkpJgYmICPT09hc8DEVFeyt/fwEQFRKdOnRAbG4v9+/fL09LS0rBx40Z07txZ5TlJSUkYMWIESpQoAalUCldXV0ybNg1CCIV8EokEAwcOxOrVq1G+fHlIpVKEh4cDAC5dugRfX18YGRmhePHimDhxIpYvXw6JRILIyEh5Gdu2bUPTpk1RtGhRSKVSODs7Y8KECcjIyFC4Vt26deHu7o5r166hXr16MDY2RrFixTBlypRcaacP5zRGRkZCIpFg2rRpWLJkCZydnSGVSlGtWjWcPXtW6fwNGzbAzc0NhoaGcHd3x5YtW1TOk5TJZJg1axbKly8PQ0ND2Nraol+/foiLi1PI5+joiGbNmmHv3r2oWrUqjIyMsHjxYgCK7e7q6gpDQ0N4enri2LFjCmVkzlu8du0aOnfuDAsLC3nEWdWcxsxyt27dCnd3d0ilUpQvX17+M808b9SoUQAAJycn+RD/+z9TIqKc4p+wRPmAo6MjvLy8sGbNGjRu3BgAsGfPHiQkJKBjx46YM2eOQn4hBFq0aIHDhw+jV69e8PDwwN69ezFq1Cg8efIEM2fOVMh/6NAhrF+/HgMHDkSRIkXg6OiIJ0+eoF69epBIJAgKCoKJiQn+/PNPlUPhoaGhMDU1xfDhw2FqaopDhw5h7NixePXqFaZOnaqQNy4uDo0aNUKbNm3Qvn17bNy4ET/88AMqVKggv7fsvH37VimyamxsDGNj4yzPCQsLw+vXr9GvXz9IJBJMmTIFbdq0wb1796Cvrw8A2LVrFzp06IAKFSogJCQEcXFx6NWrF4oVK6ZUXr9+/RAaGorAwEAMHjwY9+/fx7x583DhwgX8/fff8jIB4ObNm+jUqRP69euHPn36wNXVVX7s6NGjWLduHQYPHgypVIoFCxagUaNGOHPmDNzd3RWu+e2336J06dKYNGmSUsf/QydOnMDmzZvRv39/FCpUCHPmzEHbtm3x8OFDWFlZoU2bNrh16xbWrFmDmTNnokiRIgAAa2vrbMslIsqWIKI8s3z5cgFAnD17VsybN08UKlRIJCcnCyGE+Pbbb0W9evWEEEI4ODiIpk2bys/bunWrACAmTpyoUF67du2ERCIRd+7ckacBEDo6OuLq1asKeQcNGiQkEom4cOGCPC02NlZYWloKAOL+/fvy9Mw6va9fv37C2NhYvHnzRp7m6+srAIiVK1fK01JTU4WdnZ1o27at2vZwcHAQAJRewcHBQgghevToIRwcHOT579+/LwAIKysr8fLlS3n6tm3bBACxY8cOeVqFChVE8eLFxevXr+VpR44cEQAUyjx+/LgAIFavXq1Qt/DwcKX0zPqGh4cr3Utm3SMiIuRpDx48EIaGhqJ169bytODgYAFAdOrUSamMzGMflmtgYKDwM/73338FADF37lx52tSpU5V+jkREn4LD00T5RPv27ZGSkoKdO3fi9evX2LlzZ5ZD07t374auri4GDx6skD5ixAgIIbBnzx6FdF9fX7i5uSmkhYeHw8vLCx4eHvI0S0tLdOnSRel678/Re/36NWJiYuDj44Pk5GTcuHFDIa+pqanCnEQDAwNUr14d9+7dy74B/q9GjRrYv3+/wqt79+7ZntOhQwdYWFjI3/v4+ACA/JpPnz7F5cuX0b17d5iamsrz+fr6okKFCgplbdiwAWZmZmjQoAFiYmLkL09PT5iamuLw4cMK+Z2cnODv76+yXl5eXvD09JS/L1myJFq2bIm9e/cqDe1/99132d7j+/z8/ODs7Cx/X7FiRRQuXFjjNiYi+hgcnibKJ6ytreHn54ewsDAkJycjIyMD7dq1U5n3wYMHKFq0KAoVKqSQXq5cOfnx9zk5Oaksw8vLSyndxcVFKe3q1av4+eefcejQIbx69UrhWEJCgsL74sWLK83Ds7CwwKVLl1Tey4eKFCkCPz8/jfJmKlmypNL1AMjnIGa2h6p7c3Fxwfnz5+Xvb9++jYSEBNjY2Ki81osXLxTeq2rbTKVLl1ZKK1OmDJKTkxEdHQ07OzuNyvnQh/cLvLvnD+dcEhHlJnYaifKRzp07o0+fPoiKikLjxo1zbauUT1nNGx8fD19fXxQuXBi//vornJ2dYWhoiPPnz+OHH35Q2gJHV1dXZTlCzTy9T5Gb15TJZLCxscHq1atVHv9wXuCnrpT+mHLyoo2JiNhpJMpHWrdujX79+uGff/7BunXrsszn4OCAAwcO4PXr1wrRxsyhYgcHB7XXcnBwwJ07d5TSP0w7cuQIYmNjsXnzZtSpU0eefv/+fbXXyC8y20OT+3V2dsaBAwfg7e39yR3C27dvK6XdunULxsbGWl+UwifJEFFu45xGonzE1NQUCxcuxLhx49C8efMs8zVp0gQZGRmYN2+eQvrMmTMhkUg0WqXs7++PU6dO4eLFi/K0ly9fKkXYMqNa70ex0tLSsGDBAk1uKV8oWrQo3N3dsXLlSiQmJsrTjx49isuXLyvkbd++PTIyMjBhwgSlctLT03P0hJVTp04pDH0/evQI27ZtQ8OGDbOMFuYWExMTAOATYYgo1zDSSJTP9OjRQ22e5s2bo169evjpp58QGRmJSpUqYd++fdi2bRuGDh2qsEgiK6NHj8aqVavQoEEDDBo0SL7lTsmSJfHy5Ut5pKpWrVqwsLBAjx49MHjwYEgkEvz1119f3FDopEmT0LJlS3h7eyMwMBBxcXGYN28e3N3dFTqSvr6+6NevH0JCQnDx4kU0bNgQ+vr6uH37NjZs2IDZs2dnOdf0Q+7u7vD391fYcgcAxo8fr5V7fF/mApyffvoJHTt2hL6+Ppo3by7vTBIR5RQ7jURfIB0dHWzfvh1jx47FunXrsHz5cjg6OmLq1KkqH8GnSokSJXD48GEMHjwYkyZNgrW1NQYMGAATExMMHjxY/sQYKysr7Ny5EyNGjMDPP/8MCwsLdO3aFd98802Wq4bzo+bNm2PNmjUYN24cxowZg9KlSyM0NBQrVqzA1atXFfIuWrQInp6eWLx4MX788Ufo6enB0dERXbt2hbe3t8bX9PX1hZeXF8aPH4+HDx/Czc0NoaGhqFixYm7fnpJq1aphwoQJWLRoEcLDwyGTyXD//n12Gonoo0nElxYuICKtGjp0KBYvXozExEStD6HmBx4eHrC2tlZ4Gk9ukEgkGDBggNIUAiKiLxXnNBIVYCkpKQrvY2Nj8ddff6F27dpfXYfx7du3SE9PV0g7cuQI/v33X9StWzdvKkVE9AXh8DRRAebl5YW6deuiXLlyeP78OZYuXYpXr17hl19+yeuq5bonT57Az88PXbt2RdGiRXHjxg0sWrQIdnZ2OdpYm4iooGKnkagAa9KkCTZu3IglS5ZAIpGgSpUqWLp0qcLWOl8LCwsLeHp64s8//0R0dDRMTEzQtGlT/P7777Cyssrr6hER5Xuc00hEREREanFOIxERERGpxU4jEREREanFTiN9VSQSCcaNG6fVa4SGhkIikSAyMlKr1wHere6VSCTYuHGj1q+VGxwdHREQEJBr5UVGRkIikSA0NDTXysxPAgIC4OjomNfVKHC+9s8Vkbaw00hymZ0hiUSCEydOKB0XQqBEiRKQSCRo1qxZHtTw6xUWFoZZs2bldTUon/rjjz/g6+sLW1tbSKVSODk5ITAwUOUfLgsXLsS3336LkiVLQiKRZNmJr1u3rvzf+4cvfX19hbzDhg1DlSpVYGlpCWNjY5QrVw7jxo1TeJIOACQmJiI4OBiNGjWCpaXlV9sx27FjB3R0dBAVFZXXVSH6rLh6mpQYGhoiLCwMtWvXVkg/evQoHj9+DKlUmkc1Uy8lJQV6el/exzosLAxXrlzB0KFD87oqlA9duHABTk5OaNGiBSwsLHD//n388ccf2LlzJ/79918ULVpUnnfy5Ml4/fo1qlevjmfPnmVZ5k8//YTevXsrpCUlJeG7775Dw4YNFdLPnj0LHx8fBAYGwtDQEBcuXMDvv/+OAwcO4NixY9DReRd/iImJwa+//oqSJUuiUqVKOHLkSO41Qj6ya9cueHp6ws7OLq+rQvRZfXm/XUnrmjRpgg0bNmDOnDkKHbCwsDB4enoiJiYmD2uXvcxH39E7ycnJMDY2zutq0CfKfGb1+1q1aoWqVati5cqVGDNmjDz96NGj8iijqalplmU2aNBAKW3VqlUAgC5duiikqxp5cHZ2xsiRI3HmzBnUrFkTAGBvb49nz57Bzs4OERERqFatmmY3+IXZvXs3evbsmdfVIPrsODxNSjp16oTY2FiFx6qlpaVh48aN6Ny5s8pzpk2bhlq1asHKygpGRkbw9PRUmoe3fPlySCQSLFu2TCF90qRJkEgk2L17d7b1ioiIgL+/P4oUKQIjIyM4OTkpfXF/OKdx3LhxkEgkuHPnDgICAmBubg4zMzMEBgYiOTlZ4dyUlBQMHjwYRYoUQaFChdCiRQs8efJE43mSe/bsgY+PD0xMTFCoUCE0bdpU6ZnGqtStWxe7du3CgwcP5MODH85zk8lk+O2331C8eHEYGhrim2++wZ07d5TKcXd3x7lz51CnTh0YGxvjxx9/BACkpqYiODgYLi4ukEqlKFGiBEaPHo3U1FSFMpYvX4769evDxsYGUqkUbm5uWLhwoVKdhRCYOHEiihcvDmNjY9SrVy/Le42Pj8fQoUNRokQJSKVSuLi4YPLkyZDJZEr5AgICYGZmBnNzc/To0QPx8fFq2w9497SX8ePHo3Tp0jA0NISVlRVq166t8BkOCAiAqakp7t27B39/f5iYmKBo0aL49ddf8eHOYzKZDLNmzUL58uVhaGgIW1tb9OvXD3FxcUrX1vTnvnXrVri7u8PQ0BDu7u7YsmWLRveWlczPyIdt5ODgAIlE8lFlhoWFwcTEBC1btvyo60ul0k+KvmX+jB4+fI
gitextract_o126_axo/ ├── .github/ │ ├── CODEOWNERS │ └── workflows/ │ ├── code_quality.yaml │ ├── publish.yaml │ ├── pytest.yaml │ └── welcome.yaml ├── .gitignore ├── .pre-commit-config.yaml ├── .readthedocs.yaml ├── .vscode/ │ ├── extensions.json │ └── settings.json ├── CITATION.bib ├── CONTRIBUTING.md ├── LICENSE ├── MANIFEST.in ├── Makefile ├── README.md ├── docs/ │ ├── api/ │ │ ├── fingerprints.base.md │ │ ├── scikit_mol.applicability.md │ │ ├── scikit_mol.conversions.md │ │ ├── scikit_mol.core.md │ │ ├── scikit_mol.descriptors.md │ │ ├── scikit_mol.fingerprints.md │ │ ├── scikit_mol.parallel.md │ │ ├── scikit_mol.plotting.md │ │ ├── scikit_mol.safeinference.md │ │ └── scikit_mol.standardizer.md │ ├── assets/ │ │ ├── css/ │ │ │ └── tweak-width.css │ │ └── js/ │ │ └── readthedocs.js │ ├── contributing.md │ ├── index.md │ ├── notebooks/ │ │ ├── 01_basic_usage.ipynb │ │ ├── 02_descriptor_transformer.ipynb │ │ ├── 03_example_pipeline.ipynb │ │ ├── 04_standardizer.ipynb │ │ ├── 05_smiles_sanitization.ipynb │ │ ├── 06_hyperparameter_tuning.ipynb │ │ ├── 07_parallel_transforms.ipynb │ │ ├── 08_external_library_skopt.ipynb │ │ ├── 09_Combinatorial_Method_Usage_with_FingerPrint_Transformers.ipynb │ │ ├── 10_pipeline_pandas_output.ipynb │ │ ├── 11_safe_inference.ipynb │ │ ├── 12_custom_fingerprint_transformer.ipynb │ │ ├── 13_applicability_domain.ipynb │ │ ├── README.md │ │ ├── pair_notebook.sh │ │ ├── run_notebooks.sh │ │ ├── scripts/ │ │ │ ├── 01_basic_usage.py │ │ │ ├── 02_descriptor_transformer.py │ │ │ ├── 03_example_pipeline.py │ │ │ ├── 04_standardizer.py │ │ │ ├── 05_smiles_sanitization.py │ │ │ ├── 06_hyperparameter_tuning.py │ │ │ ├── 07_parallel_transforms.py │ │ │ ├── 08_external_library_skopt.py │ │ │ ├── 09_Combinatorial_Method_Usage_with_FingerPrint_Transformers.py │ │ │ ├── 10_pipeline_pandas_output.py │ │ │ ├── 11_safe_inference.py │ │ │ ├── 12_custom_fingerprint_transformer.py │ │ │ └── 13_applicability_domain.py │ │ └── sync_notebooks.sh │ └── overrides/ │ └── main.html ├── mkdocs.yml ├── pyproject.toml ├── resources/ │ └── logo/ │ ├── ScikitMol_Logo.ai │ └── ScikitMol_Logo_Hybrid.ai ├── ruff.toml ├── scikit_mol/ │ ├── __init__.py │ ├── _constants.py │ ├── applicability/ │ │ ├── LICENSE.MIT │ │ ├── README.md │ │ ├── __init__.py │ │ ├── base.py │ │ ├── bounding_box.py │ │ ├── convex_hull.py │ │ ├── hotelling.py │ │ ├── isolation_forest.py │ │ ├── kernel_density.py │ │ ├── knn.py │ │ ├── leverage.py │ │ ├── local_outlier.py │ │ ├── mahalanobis.py │ │ ├── standardization.py │ │ └── topkat.py │ ├── conversions.py │ ├── core.py │ ├── descriptors.py │ ├── fingerprints/ │ │ ├── __init__.py │ │ ├── atompair.py │ │ ├── avalon.py │ │ ├── baseclasses.py │ │ ├── maccs.py │ │ ├── minhash.py │ │ ├── morgan.py │ │ ├── rdkitfp.py │ │ └── topologicaltorsion.py │ ├── parallel.py │ ├── plotting.py │ ├── safeinference.py │ ├── standardizer.py │ └── utilities.py ├── setup.cfg ├── tests/ │ ├── __init__.py │ ├── applicability/ │ │ ├── __init__.py │ │ ├── conftest.py │ │ ├── test_base.py │ │ ├── test_bounding_box.py │ │ ├── test_convex_hull.py │ │ ├── test_hotelling.py │ │ ├── test_isolation_forest.py │ │ ├── test_kernel_density.py │ │ ├── test_knn.py │ │ ├── test_leverage.py │ │ ├── test_local_outlier.py │ │ ├── test_mahalanobis.py │ │ ├── test_standardization.py │ │ └── test_topkat.py │ ├── conftest.py │ ├── fixtures.py │ ├── test_desctransformer.py │ ├── test_fptransformers.py │ ├── test_fptransformersgenerator.py │ ├── test_parameter_types.py │ ├── test_safeinferencemode.py │ ├── test_sanitizer.py │ ├── test_scikit_mol.py │ ├── test_smilestomol.py │ └── test_transformers.py └── uv.toml
SYMBOL INDEX (377 symbols across 58 files)
FILE: docs/notebooks/scripts/06_hyperparameter_tuning.py
function report (line 125) | def report(results, n_top=3):
FILE: docs/notebooks/scripts/07_parallel_transforms.py
function test_transformer (line 72) | def test_transformer(transformer):
FILE: docs/notebooks/scripts/08_external_library_skopt.py
function objective (line 80) | def objective(**params):
FILE: docs/notebooks/scripts/10_pipeline_pandas_output.py
function compute_metrics (line 132) | def compute_metrics(y_true, y_pred):
function combine_datasets (line 201) | def combine_datasets(data, cddd):
FILE: docs/notebooks/scripts/12_custom_fingerprint_transformer.py
class DummyFingerprintTransformer (line 36) | class DummyFingerprintTransformer(BaseFpsTransformer):
method __init__ (line 37) | def __init__(self, fpSize=64, n_jobs=1, safe_inference_mode=False):
method _transform_mol (line 43) | def _transform_mol(self, mol):
class UnpickableFingerprintTransformer (line 63) | class UnpickableFingerprintTransformer(BaseFpsTransformer):
method __init__ (line 64) | def __init__(self, fpSize=1024, n_jobs=1, safe_inference_mode=False, *...
method _transform_mol (line 73) | def _transform_mol(self, mol):
class BadTransformer (line 85) | class BadTransformer(BaseFpsTransformer):
method __init__ (line 86) | def __init__(self, generator, n_jobs=1):
method _transform_mol (line 90) | def _transform_mol(self, mol):
class NamedTansformer1 (line 117) | class NamedTansformer1(UnpickableFingerprintTransformer):
class NamedTansformer2 (line 121) | class NamedTansformer2(UnpickableFingerprintTransformer):
method __init__ (line 122) | def __init__(self):
class FancyFingerprintTransformer (line 126) | class FancyFingerprintTransformer(UnpickableFingerprintTransformer):
FILE: docs/notebooks/scripts/13_applicability_domain.py
function check_drug_applicability (line 220) | def check_drug_applicability(smiles, name):
FILE: scikit_mol/applicability/base.py
class _ADOutputMixin (line 15) | class _ADOutputMixin(_SetOutputMixin):
method __init_subclass__ (line 18) | def __init_subclass__(cls, **kwargs):
function _safe_flatten (line 30) | def _safe_flatten(X: Union[ArrayLike, pd.DataFrame]) -> NDArray[np.float...
class BaseApplicabilityDomain (line 48) | class BaseApplicabilityDomain(BaseEstimator, TransformerMixin, _ADOutput...
method __init__ (line 82) | def __init__(
method fit (line 108) | def fit(self, X: ArrayLike, y: Optional[Any] = None) -> "BaseApplicabi...
method fit_threshold (line 125) | def fit_threshold(
method transform (line 161) | def transform(
method _transform (line 188) | def _transform(self, X: NDArray) -> NDArray[np.float64]:
method predict (line 203) | def predict(
method score_transform (line 225) | def score_transform(
method get_feature_names_out (line 253) | def get_feature_names_out(self, input_features=None) -> NDArray[np.str_]:
FILE: scikit_mol/applicability/bounding_box.py
class BoundingBoxApplicabilityDomain (line 18) | class BoundingBoxApplicabilityDomain(BaseApplicabilityDomain):
method __init__ (line 73) | def __init__(
method fit (line 93) | def fit(
method _transform (line 122) | def _transform(self, X: NDArray) -> NDArray[np.float64]:
FILE: scikit_mol/applicability/convex_hull.py
class ConvexHullApplicabilityDomain (line 19) | class ConvexHullApplicabilityDomain(BaseApplicabilityDomain):
method __init__ (line 56) | def __init__(
method fit (line 62) | def fit(
method _transform (line 87) | def _transform(self, X: NDArray) -> NDArray[np.float64]:
FILE: scikit_mol/applicability/hotelling.py
class HotellingT2ApplicabilityDomain (line 20) | class HotellingT2ApplicabilityDomain(BaseApplicabilityDomain):
method __init__ (line 61) | def __init__(
method _set_statistical_threshold (line 72) | def _set_statistical_threshold(self, X: NDArray) -> None:
method fit (line 87) | def fit(
method _transform (line 118) | def _transform(self, X: NDArray) -> NDArray[np.float64]:
FILE: scikit_mol/applicability/isolation_forest.py
class IsolationForestApplicabilityDomain (line 20) | class IsolationForestApplicabilityDomain(BaseApplicabilityDomain):
method __init__ (line 63) | def __init__(
method fit (line 78) | def fit(
method _transform (line 114) | def _transform(self, X: NDArray) -> NDArray[np.float64]:
FILE: scikit_mol/applicability/kernel_density.py
class KernelDensityApplicabilityDomain (line 20) | class KernelDensityApplicabilityDomain(BaseApplicabilityDomain):
method __init__ (line 64) | def __init__(
method fit (line 75) | def fit(
method _transform (line 104) | def _transform(self, X: NDArray) -> NDArray[np.float64]:
FILE: scikit_mol/applicability/knn.py
class KNNApplicabilityDomain (line 20) | class KNNApplicabilityDomain(BaseApplicabilityDomain):
method __init__ (line 94) | def __init__(
method distance_metric (line 110) | def distance_metric(self) -> Union[Callable, str]:
method distance_metric (line 114) | def distance_metric(self, value: Union[str, Callable]) -> None:
method fit (line 122) | def fit(self, X: ArrayLike, y=None) -> "KNNApplicabilityDomain":
method _transform (line 160) | def _transform(self, X: np.ndarray) -> np.ndarray:
FILE: scikit_mol/applicability/leverage.py
class LeverageApplicabilityDomain (line 19) | class LeverageApplicabilityDomain(BaseApplicabilityDomain):
method __init__ (line 81) | def __init__(
method _set_statistical_threshold (line 90) | def _set_statistical_threshold(self, X: NDArray) -> None:
method fit (line 95) | def fit(
method _transform (line 123) | def _transform(self, X: NDArray) -> NDArray[np.float64]:
FILE: scikit_mol/applicability/local_outlier.py
class LocalOutlierFactorApplicabilityDomain (line 20) | class LocalOutlierFactorApplicabilityDomain(BaseApplicabilityDomain):
method __init__ (line 63) | def __init__(
method fit (line 76) | def fit(
method _transform (line 109) | def _transform(self, X: NDArray) -> NDArray[np.float64]:
method _set_statistical_threshold (line 127) | def _set_statistical_threshold(self, X):
FILE: scikit_mol/applicability/mahalanobis.py
class MahalanobisApplicabilityDomain (line 15) | class MahalanobisApplicabilityDomain(BaseApplicabilityDomain):
method __init__ (line 50) | def __init__(
method _set_statistical_threshold (line 57) | def _set_statistical_threshold(self, X: NDArray) -> None:
method fit (line 67) | def fit(
method _transform (line 113) | def _transform(self, X: NDArray) -> NDArray[np.float64]:
FILE: scikit_mol/applicability/standardization.py
class StandardizationApplicabilityDomain (line 21) | class StandardizationApplicabilityDomain(BaseApplicabilityDomain):
method __init__ (line 54) | def __init__(
method _set_statistical_threshold (line 61) | def _set_statistical_threshold(self, X: NDArray) -> None:
method fit (line 69) | def fit(
method _transform (line 98) | def _transform(self, X: NDArray) -> NDArray[np.float64]:
FILE: scikit_mol/applicability/topkat.py
class TopkatApplicabilityDomain (line 20) | class TopkatApplicabilityDomain(BaseApplicabilityDomain):
method __init__ (line 62) | def __init__(
method fit (line 69) | def fit(self, X: ArrayLike, y: Optional[Any] = None) -> "TopkatApplica...
method _transform (line 113) | def _transform(self, X: NDArray) -> NDArray[np.float64]:
FILE: scikit_mol/conversions.py
class SmilesToMolTransformer (line 22) | class SmilesToMolTransformer(TransformerMixin, NoFitNeededMixin, BaseEst...
method __init__ (line 36) | def __init__(
method get_feature_names_out (line 53) | def get_feature_names_out(self, input_features=None):
method fit (line 56) | def fit(self, X=None, y=None):
method transform (line 60) | def transform(
method _transform (line 85) | def _transform(self, X):
method inverse_transform (line 110) | def inverse_transform(self, X_mols_list, y=None):
FILE: scikit_mol/core.py
class NoFitNeededMixin (line 19) | class NoFitNeededMixin:
method __sklearn_is_fitted__ (line 24) | def __sklearn_is_fitted__(self):
class InvalidMol (line 29) | class InvalidMol:
method __bool__ (line 44) | def __bool__(self):
method __repr__ (line 47) | def __repr__(self):
function _validate_transform_input (line 51) | def _validate_transform_input(X):
function check_transform_input (line 71) | def check_transform_input(method):
function feature_names_default_mol (line 88) | def feature_names_default_mol(method):
FILE: scikit_mol/descriptors.py
class MolecularDescriptorTransformer (line 15) | class MolecularDescriptorTransformer(TransformerMixin, NoFitNeededMixin,...
method __init__ (line 42) | def __init__(
method _get_desc_calculator (line 54) | def _get_desc_calculator(self) -> MolecularDescriptorCalculator:
method desc_list (line 67) | def desc_list(self):
method get_feature_names_out (line 71) | def get_feature_names_out(self, input_features=None):
method desc_list (line 75) | def desc_list(self, desc_list):
method available_descriptors (line 80) | def available_descriptors(self) -> List[str]:
method selected_descriptors (line 85) | def selected_descriptors(self) -> List[str]:
method start_method (line 90) | def start_method(self):
method start_method (line 94) | def start_method(self, start_method):
method _transform_mol (line 103) | def _transform_mol(self, mol: Mol) -> Union[np.ndarray, np.ma.MaskedAr...
method fit (line 117) | def fit(self, x, y=None):
method _transform (line 122) | def _transform(self, x: List[Mol]) -> Union[np.ndarray, np.ma.MaskedAr...
method transform (line 132) | def transform(self, x: List[Mol], y=None) -> Union[np.ndarray, np.ma.M...
function parallel_helper (line 158) | def parallel_helper(params, mols):
FILE: scikit_mol/fingerprints/atompair.py
class AtomPairFingerprintTransformer (line 9) | class AtomPairFingerprintTransformer(FpsGeneratorTransformer):
method __init__ (line 24) | def __init__(
method _generate_fp_generator (line 86) | def _generate_fp_generator(self):
method _transform_mol (line 95) | def _transform_mol(self, mol) -> np.array:
FILE: scikit_mol/fingerprints/avalon.py
class AvalonFingerprintTransformer (line 9) | class AvalonFingerprintTransformer(FpsTransformer):
method __init__ (line 12) | def __init__(
method _mol2fp (line 53) | def _mol2fp(self, mol):
FILE: scikit_mol/fingerprints/baseclasses.py
class BaseFpsTransformer (line 24) | class BaseFpsTransformer(TransformerMixin, NoFitNeededMixin, ABC, BaseEs...
method __init__ (line 39) | def __init__(
method nBits (line 51) | def nBits(self):
method nBits (line 61) | def nBits(self, nBits):
method _get_column_prefix (line 70) | def _get_column_prefix(self) -> str:
method _get_n_digits_column_suffix (line 81) | def _get_n_digits_column_suffix(self) -> int:
method get_display_feature_names_out (line 84) | def get_display_feature_names_out(self, input_features=None):
method get_feature_names_out (line 97) | def get_feature_names_out(self, input_features=None):
method _safe_transform_mol (line 106) | def _safe_transform_mol(self, mol):
method _transform_mol (line 121) | def _transform_mol(self, mol):
method fit (line 125) | def fit(self, X, y=None):
method _transform (line 133) | def _transform(self, X):
method _transform_sparse (line 141) | def _transform_sparse(self, X):
method transform (line 148) | def transform(self, X, y=None):
class FpsTransformer (line 176) | class FpsTransformer(BaseFpsTransformer):
method __init__ (line 179) | def __init__(
method _transform_mol (line 188) | def _transform_mol(self, mol):
method _mol2fp (line 194) | def _mol2fp(self, mol):
method _fp2array (line 201) | def _fp2array(self, fp):
method _transform (line 211) | def _transform(self, X):
method _get_param_names (line 222) | def _get_param_names(self):
class FpsGeneratorTransformer (line 229) | class FpsGeneratorTransformer(BaseFpsTransformer):
method __getstate__ (line 234) | def __getstate__(self):
method __setstate__ (line 242) | def __setstate__(self, state):
method __setattr__ (line 260) | def __setattr__(self, name: str, value):
method _generate_fp_generator (line 269) | def _generate_fp_generator(self):
method _transform_mol (line 273) | def _transform_mol(self, mol) -> np.array:
method dtype (line 282) | def dtype(self):
method dtype (line 292) | def dtype(self, dtype):
method _get_param_names (line 302) | def _get_param_names(self):
function parallel_helper (line 309) | def parallel_helper(X_mols, cls: Type[BaseFpsTransformer], parameters: d...
FILE: scikit_mol/fingerprints/maccs.py
class MACCSKeysFingerprintTransformer (line 9) | class MACCSKeysFingerprintTransformer(FpsTransformer):
method __init__ (line 12) | def __init__(
method fpSize (line 49) | def fpSize(self):
method fpSize (line 53) | def fpSize(self, fpSize):
method _mol2fp (line 60) | def _mol2fp(self, mol):
FILE: scikit_mol/fingerprints/minhash.py
class MHFingerprintTransformer (line 11) | class MHFingerprintTransformer(FpsTransformer):
method __init__ (line 14) | def __init__(
method __getstate__ (line 66) | def __getstate__(self):
method __setstate__ (line 73) | def __setstate__(self, state):
method _mol2fp (line 79) | def _mol2fp(self, mol):
method _fp2array (line 85) | def _fp2array(self, fp):
method _recreate_encoder (line 88) | def _recreate_encoder(self):
method seed (line 94) | def seed(self):
method seed (line 98) | def seed(self, seed):
method n_permutations (line 104) | def n_permutations(self):
method n_permutations (line 112) | def n_permutations(self, n_permutations):
class SECFingerprintTransformer (line 123) | class SECFingerprintTransformer(FpsTransformer):
method __init__ (line 125) | def __init__(
method __getstate__ (line 166) | def __getstate__(self):
method __setstate__ (line 173) | def __setstate__(self, state):
method _mol2fp (line 179) | def _mol2fp(self, mol):
method _recreate_encoder (line 190) | def _recreate_encoder(self):
method seed (line 196) | def seed(self):
method seed (line 200) | def seed(self, seed):
method n_permutations (line 206) | def n_permutations(self):
method n_permutations (line 210) | def n_permutations(self, n_permutations):
method length (line 216) | def length(self):
FILE: scikit_mol/fingerprints/morgan.py
class MorganFingerprintTransformer (line 12) | class MorganFingerprintTransformer(FpsGeneratorTransformer):
method __init__ (line 28) | def __init__(
method _generate_fp_generator (line 78) | def _generate_fp_generator(self):
method _transform_mol (line 92) | def _transform_mol(self, mol) -> np.array:
FILE: scikit_mol/fingerprints/rdkitfp.py
class RDKitFingerprintTransformer (line 9) | class RDKitFingerprintTransformer(FpsGeneratorTransformer):
method __init__ (line 22) | def __init__(
method _transform_mol (line 83) | def _transform_mol(self, mol) -> np.array:
method _generate_fp_generator (line 89) | def _generate_fp_generator(self):
FILE: scikit_mol/fingerprints/topologicaltorsion.py
class TopologicalTorsionFingerprintTransformer (line 9) | class TopologicalTorsionFingerprintTransformer(FpsGeneratorTransformer):
method __init__ (line 16) | def __init__(
method _generate_fp_generator (line 69) | def _generate_fp_generator(self):
method _transform_mol (line 79) | def _transform_mol(self, mol) -> np.array:
FILE: scikit_mol/parallel.py
function parallelized_with_batches (line 9) | def parallelized_with_batches(
FILE: scikit_mol/plotting.py
class ParallelTester (line 17) | class ParallelTester:
method __init__ (line 22) | def __init__(
method _test_single (line 60) | def _test_single(self, mols, n_jobs):
method test (line 66) | def test(self) -> pd.DataFrame:
function get_processor_name (line 81) | def get_processor_name() -> str:
function plot_heatmap (line 113) | def plot_heatmap(
FILE: scikit_mol/safeinference.py
class MaskedArrayError (line 21) | class MaskedArrayError(ValueError):
function filter_invalid_rows (line 27) | def filter_invalid_rows(warn_on_invalid=False, replace_value=np.nan):
class SafeInferenceWrapper (line 118) | class SafeInferenceWrapper(TransformerMixin, BaseEstimator):
method __init__ (line 130) | def __init__(
method n_features_in_ (line 155) | def n_features_in_(self):
method fit (line 159) | def fit(self, X, y=None, **fit_params):
method predict (line 164) | def predict(self, X, y=None):
method predict_proba (line 169) | def predict_proba(self, X, y=None):
method decision_function (line 174) | def decision_function(self, X, y=None):
method transform (line 179) | def transform(self, X, y=None):
method fit_transform (line 184) | def fit_transform(self, X, y=None, **fit_params):
method score (line 189) | def score(self, X, y=None):
method get_feature_names_out (line 194) | def get_feature_names_out(self, *args, **kwargs):
method __sklearn_is_fitted__ (line 197) | def __sklearn_is_fitted__(self):
FILE: scikit_mol/standardizer.py
class Standardizer (line 24) | class Standardizer(TransformerMixin, NoFitNeededMixin, BaseEstimator):
method __init__ (line 32) | def __init__(
method fit (line 54) | def fit(self, X, y=None):
method _standardize_mol (line 57) | def _standardize_mol(self, mol):
method _transform (line 88) | def _transform(self, X):
method get_feature_names_out (line 92) | def get_feature_names_out(self, input_features=None):
method transform (line 96) | def transform(self, X, y=None):
function parallel_helper (line 104) | def parallel_helper(classname, parameters, X_mols):
FILE: scikit_mol/utilities.py
class CheckSmilesSanitization (line 12) | class CheckSmilesSanitization:
method __init__ (line 13) | def __init__(self, return_mol=False):
method sanitize (line 17) | def sanitize(self, X_smiles_list, y=None):
function set_safe_inference_mode (line 76) | def set_safe_inference_mode(estimator: BaseEstimator, value: bool) -> Ba...
FILE: tests/applicability/conftest.py
function ad_estimator (line 54) | def ad_estimator(request):
function reduced_fingerprints (line 61) | def reduced_fingerprints(mols_list):
function binary_fingerprints (line 71) | def binary_fingerprints(mols_list):
function ad_test_data (line 77) | def ad_test_data():
FILE: tests/applicability/test_base.py
function test_basic_functionality (line 9) | def test_basic_functionality(ad_estimator, reduced_fingerprints):
function test_predict_functionality (line 17) | def test_predict_functionality(ad_estimator, ad_test_data):
function test_score_transform (line 34) | def test_score_transform(ad_estimator, ad_test_data):
function test_threshold_setting (line 54) | def test_threshold_setting(ad_estimator, reduced_fingerprints):
function test_feature_names (line 74) | def test_feature_names(ad_estimator, reduced_fingerprints):
function test_pandas_output (line 84) | def test_pandas_output(ad_estimator, reduced_fingerprints):
function test_input_validation (line 101) | def test_input_validation(ad_estimator):
function test_refit_consistency (line 117) | def test_refit_consistency(ad_estimator, reduced_fingerprints):
FILE: tests/applicability/test_bounding_box.py
function test_bounding_box_bounds (line 12) | def test_bounding_box_bounds(ad_test_data):
function test_bounding_box_violations (line 26) | def test_bounding_box_violations():
function test_bounding_box_percentile_validation (line 46) | def test_bounding_box_percentile_validation():
function test_bounding_box_pipeline (line 61) | def test_bounding_box_pipeline():
FILE: tests/applicability/test_convex_hull.py
function test_convex_hull_simple (line 13) | def test_convex_hull_simple():
function test_convex_hull_pipeline (line 32) | def test_convex_hull_pipeline():
function test_convex_hull_numerical_stability (line 52) | def test_convex_hull_numerical_stability():
function test_convex_hull_single_point (line 71) | def test_convex_hull_single_point():
FILE: tests/applicability/test_hotelling.py
function test_hotelling_threshold (line 12) | def test_hotelling_threshold():
function test_hotelling_scores (line 28) | def test_hotelling_scores():
function test_hotelling_significance_validation (line 50) | def test_hotelling_significance_validation():
function test_hotelling_pipeline (line 62) | def test_hotelling_pipeline():
function test_hotelling_threshold_fitting (line 77) | def test_hotelling_threshold_fitting():
FILE: tests/applicability/test_isolation_forest.py
function test_refit_consistency (line 8) | def test_refit_consistency():
FILE: tests/applicability/test_kernel_density.py
function ad_estimator (line 9) | def ad_estimator():
function test_kernel_parameter (line 14) | def test_kernel_parameter():
function test_bandwidth_effect (line 29) | def test_bandwidth_effect():
FILE: tests/applicability/test_knn.py
function binary_fingerprints (line 11) | def binary_fingerprints(mols_list):
function test_knn_tanimoto (line 16) | def test_knn_tanimoto(binary_fingerprints):
FILE: tests/applicability/test_leverage.py
function test_leverage_statistical_threshold (line 13) | def test_leverage_statistical_threshold(ad_test_data):
function test_leverage_pipeline (line 25) | def test_leverage_pipeline(reduced_fingerprints):
function test_leverage_threshold_factor (line 41) | def test_leverage_threshold_factor():
function test_leverage_var_covar_matrix (line 55) | def test_leverage_var_covar_matrix(ad_test_data):
FILE: tests/applicability/test_local_outlier.py
function ad_estimator (line 10) | def ad_estimator():
function test_n_neighbors_effect (line 15) | def test_n_neighbors_effect():
function test_metric_parameter (line 37) | def test_metric_parameter():
function test_contamination_effect (line 49) | def test_contamination_effect():
FILE: tests/applicability/test_mahalanobis.py
function ad_estimator (line 11) | def ad_estimator():
function test_statistical_threshold (line 16) | def test_statistical_threshold():
function test_mean_covariance (line 35) | def test_mean_covariance():
function test_distance_properties (line 50) | def test_distance_properties():
FILE: tests/applicability/test_standardization.py
function ad_estimator (line 11) | def ad_estimator():
function test_statistical_threshold (line 16) | def test_statistical_threshold():
function test_standardization (line 33) | def test_standardization():
function test_max_absolute_score (line 47) | def test_max_absolute_score():
function test_outlier_detection (line 64) | def test_outlier_detection():
FILE: tests/applicability/test_topkat.py
function ad_estimator (line 11) | def ad_estimator():
function test_ops_transformation (line 16) | def test_ops_transformation():
function test_fixed_threshold (line 32) | def test_fixed_threshold():
function test_eigendecomposition (line 43) | def test_eigendecomposition():
FILE: tests/conftest.py
function pytest_configure (line 14) | def pytest_configure(config):
function md5 (line 25) | def md5(fn):
function data_pth (line 34) | def data_pth(tmp_path_factory) -> Path:
function data (line 54) | def data(data_pth) -> pd.DataFrame:
function pandas_output (line 59) | def pandas_output():
function setup_random (line 68) | def setup_random():
FILE: tests/fixtures.py
function smiles_list (line 49) | def smiles_list():
function smiles_container (line 76) | def smiles_container(
function chiral_smiles_list (line 83) | def chiral_smiles_list(): # Need to be a certain size, so the fingerpri...
function smiles_list_with_invalid (line 95) | def smiles_list_with_invalid(smiles_list):
function invalid_smiles_list (line 102) | def invalid_smiles_list():
function mols_list (line 110) | def mols_list():
function mols_container (line 115) | def mols_container(request):
function chiral_mols_list (line 120) | def chiral_mols_list(chiral_smiles_list):
function mols_with_invalid_container (line 125) | def mols_with_invalid_container(smiles_list_with_invalid):
function fingerprint (line 137) | def fingerprint(mols_list):
function SLC6A4_subset (line 147) | def SLC6A4_subset():
function SLC6A4_subset_with_cddd (line 153) | def SLC6A4_subset_with_cddd(SLC6A4_subset):
function featurizer (line 185) | def featurizer(request):
function combined_transformer (line 190) | def combined_transformer(featurizer):
function morgan_transformer (line 210) | def morgan_transformer():
function rdkit_transformer (line 215) | def rdkit_transformer():
function atompair_transformer (line 220) | def atompair_transformer():
function topologicaltorsion_transformer (line 225) | def topologicaltorsion_transformer():
FILE: tests/test_desctransformer.py
function default_descriptor_transformer (line 30) | def default_descriptor_transformer():
function selected_descriptor_transformer (line 35) | def selected_descriptor_transformer():
function test_descriptor_transformer_clonability (line 41) | def test_descriptor_transformer_clonability(default_descriptor_transform...
function test_descriptor_transformer_set_params (line 52) | def test_descriptor_transformer_set_params(default_descriptor_transformer):
function test_descriptor_transformer_available_descriptors (line 65) | def test_descriptor_transformer_available_descriptors(
function test_descriptor_transformer_transform (line 82) | def test_descriptor_transformer_transform(
function test_descriptor_transformer_wrong_descriptors (line 90) | def test_descriptor_transformer_wrong_descriptors():
function test_descriptor_transformer_parallel (line 103) | def test_descriptor_transformer_parallel(mols_list, default_descriptor_t...
function test_transform_with_safe_inference_mode (line 134) | def test_transform_with_safe_inference_mode(mols_with_invalid_container):
function test_transform_without_safe_inference_mode (line 148) | def test_transform_without_safe_inference_mode(mols_with_invalid_contain...
function test_transform_parallel_with_safe_inference_mode (line 156) | def test_transform_parallel_with_safe_inference_mode(mols_with_invalid_c...
function test_transform_parallel_without_safe_inference_mode (line 170) | def test_transform_parallel_without_safe_inference_mode(mols_with_invali...
function test_safe_inference_mode_setting (line 178) | def test_safe_inference_mode_setting():
function test_descriptor_transformer_pandas_output (line 191) | def test_descriptor_transformer_pandas_output(
function test_descriptor_transformer_pandas_output_pipeline (line 208) | def test_descriptor_transformer_pandas_output_pipeline(
FILE: tests/test_fptransformers.py
function maccs_transformer (line 31) | def maccs_transformer():
function secfp_transformer (line 36) | def secfp_transformer():
function mhfp_transformer (line 41) | def mhfp_transformer():
function avalon_transformer (line 46) | def avalon_transformer():
function test_clonability (line 50) | def test_clonability(
function test_set_params (line 71) | def test_set_params(
function test_transform (line 94) | def test_transform(
function test_transform_parallel (line 120) | def test_transform_parallel(
function test_picklable (line 146) | def test_picklable(
function assert_transformer_set_params (line 164) | def assert_transformer_set_params(tr_class, new_params, mols_list):
function test_SECFingerprintTransformer (line 197) | def test_SECFingerprintTransformer(chiral_mols_list):
function test_MHFingerprintTransformer (line 213) | def test_MHFingerprintTransformer(chiral_mols_list):
function test_AvalonFingerprintTransformer (line 228) | def test_AvalonFingerprintTransformer(chiral_mols_list):
function test_transform_with_safe_inference_mode (line 240) | def test_transform_with_safe_inference_mode(
function test_transform_without_safe_inference_mode (line 264) | def test_transform_without_safe_inference_mode(
function test_transform_parallel_with_safe_inference_mode (line 285) | def test_transform_parallel_with_safe_inference_mode(
FILE: tests/test_fptransformersgenerator.py
function test_fpstransformer_transform_mol (line 34) | def test_fpstransformer_transform_mol(transformer_class, mols_list):
function test_clonability (line 55) | def test_clonability(transformer_class):
function test_set_params (line 68) | def test_set_params(transformer_class):
function test_transform (line 81) | def test_transform(mols_container, transformer_class):
function test_transform_parallel (line 95) | def test_transform_parallel(mols_container, transformer_class):
function test_picklable (line 109) | def test_picklable(transformer_class):
function assert_transformer_set_params (line 125) | def assert_transformer_set_params(transfomer, new_params, mols_list):
function test_morgan_set_params (line 159) | def test_morgan_set_params(chiral_mols_list):
function test_atompairs_set_params (line 174) | def test_atompairs_set_params(chiral_mols_list):
function test_topologicaltorsion_set_params (line 194) | def test_topologicaltorsion_set_params(chiral_mols_list):
function test_RDKitFPTransformer (line 210) | def test_RDKitFPTransformer(chiral_mols_list):
FILE: tests/test_parameter_types.py
function test_Transformer_exotic_types (line 18) | def test_Transformer_exotic_types(
function test_RDKFp_exotic_types (line 49) | def test_RDKFp_exotic_types(mols_list, rdkit_transformer):
FILE: tests/test_safeinferencemode.py
function equal_val (line 20) | def equal_val(value, expected_value):
function transformer (line 31) | def transformer(request):
function smiles_pipeline (line 36) | def smiles_pipeline(request, transformer):
function smiles_pipeline_trained (line 53) | def smiles_pipeline_trained(smiles_pipeline, SLC6A4_subset):
function test_safeinference_wrapper_basic (line 61) | def test_safeinference_wrapper_basic(smiles_pipeline, SLC6A4_subset):
function test_safeinference_wrapper_with_single_invalid_smiles (line 80) | def test_safeinference_wrapper_with_single_invalid_smiles(smiles_pipelin...
function test_safeinference_wrapper_with_invalid_smiles (line 89) | def test_safeinference_wrapper_with_invalid_smiles(
function test_safeinference_wrapper_without_safe_mode (line 111) | def test_safeinference_wrapper_without_safe_mode(
function test_safeinference_wrapper_pandas_output (line 132) | def test_safeinference_wrapper_pandas_output(
function test_safeinference_wrapper_get_feature_names_out (line 148) | def test_safeinference_wrapper_get_feature_names_out(smiles_pipeline):
FILE: tests/test_sanitizer.py
function sanitizer (line 12) | def sanitizer():
function return_mol_sanitizer (line 17) | def return_mol_sanitizer():
function test_checksmilessanitation (line 21) | def test_checksmilessanitation(smiles_list, smiles_list_with_invalid, sa...
function test_checksmilessanitation_x_and_y (line 28) | def test_checksmilessanitation_x_and_y(
function test_checksmilessanitation_np (line 42) | def test_checksmilessanitation_np(smiles_list, smiles_list_with_invalid,...
function test_checksmilessanitation_numpy (line 51) | def test_checksmilessanitation_numpy(smiles_list, smiles_list_with_inval...
function test_checksmilessanitation_return_mol (line 60) | def test_checksmilessanitation_return_mol(
FILE: tests/test_scikit_mol.py
function test_load_data (line 2) | def test_load_data(data):
FILE: tests/test_smilestomol.py
function smilestomol_transformer (line 25) | def smilestomol_transformer():
function test_smilestomol (line 29) | def test_smilestomol(smiles_container, smilestomol_transformer):
function test_smilestomol_transform (line 39) | def test_smilestomol_transform(smilestomol_transformer, smiles_container):
function test_smilestomol_fit (line 45) | def test_smilestomol_fit(smilestomol_transformer, smiles_container):
function test_smilestomol_clone (line 50) | def test_smilestomol_clone(smilestomol_transformer):
function test_smilestomol_unsanitzable (line 57) | def test_smilestomol_unsanitzable(smiles_list_with_invalid, smilestomol_...
function test_descriptor_transformer_parallel (line 62) | def test_descriptor_transformer_parallel(smiles_container, smilestomol_t...
function test_smilestomol_inverse_transform (line 79) | def test_smilestomol_inverse_transform(smilestomol_transformer, smiles_c...
function test_smilestomol_inverse_transform_with_invalid (line 86) | def test_smilestomol_inverse_transform_with_invalid(
function test_smilestomol_get_feature_names_out (line 110) | def test_smilestomol_get_feature_names_out(smilestomol_transformer):
function test_smilestomol_safe_inference (line 115) | def test_smilestomol_safe_inference(smiles_list_with_invalid, smilestomo...
function test_smilestomol_safe_inference_pandas_output (line 140) | def test_smilestomol_safe_inference_pandas_output(
function test_pandas_output (line 165) | def test_pandas_output(smiles_container, smilestomol_transformer, pandas...
FILE: tests/test_transformers.py
function test_transformer (line 42) | def test_transformer(SLC6A4_subset):
function test_transformer_pandas_output (line 114) | def test_transformer_pandas_output(SLC6A4_subset, pandas_output):
function test_pandas_out_same_values (line 184) | def test_pandas_out_same_values(featurizer, mols_container):
function test_combined_transformer_pandas_out (line 209) | def test_combined_transformer_pandas_out(
Condensed preview — 128 files, each showing path, character count, and a content snippet. Download the .json file or copy for the full structured content (5,840K chars).
[
{
"path": ".github/CODEOWNERS",
"chars": 79,
"preview": "* @EBjerrum\nscikit_mol/parallel.py @asiomchen\nscikit_mol/plotting.py @asiomchen"
},
{
"path": ".github/workflows/code_quality.yaml",
"chars": 496,
"preview": "name: Code Quality Checks\non: [ push, pull_request ]\njobs:\n ruff-checks:\n runs-on: ubuntu-latest\n steps:\n - "
},
{
"path": ".github/workflows/publish.yaml",
"chars": 2514,
"preview": "name: Publish Python🐍 distribution📦 with uv🌈\n# after releasing a new version, build the distribution and uploads signed "
},
{
"path": ".github/workflows/pytest.yaml",
"chars": 2495,
"preview": "name: scikit_mol ci\n\non:\n push:\n branches: [main]\n tags: ['v*']\n pull_request:\n branches: [main]\n\n# cancel pr"
},
{
"path": ".github/workflows/welcome.yaml",
"chars": 758,
"preview": "name: Welcome WorkFlow\n\non:\n issues:\n types: [opened]\n pull_request_target:\n types: [opened]\n\njobs:\n build:\n "
},
{
"path": ".gitignore",
"chars": 2009,
"preview": "# Byte-compiled / optimized / DLL files\n__pycache__/\n*.py[cod]\n*$py.class\n\n# C extensions\n*.so\n\n# Distribution / packagi"
},
{
"path": ".pre-commit-config.yaml",
"chars": 590,
"preview": "repos:\n- repo: https://github.com/pre-commit/pre-commit-hooks\n rev: v4.5.0\n hooks:\n - id: requirements-txt-"
},
{
"path": ".readthedocs.yaml",
"chars": 713,
"preview": "# Read the Docs configuration file for MkDocs projects\n# See https://docs.readthedocs.io/en/stable/config-file/v2.html f"
},
{
"path": ".vscode/extensions.json",
"chars": 61,
"preview": "{\n \"recommendations\": [\n \"njpwerner.autodocstring\"\n ]\n}\n"
},
{
"path": ".vscode/settings.json",
"chars": 47,
"preview": "{\n \"autoDocstring.docstringFormat\": \"numpy\"\n}\n"
},
{
"path": "CITATION.bib",
"chars": 1538,
"preview": "@article{bjerrum_scikit-mol_2023,\n\ttitle = {Scikit-{Mol} brings cheminformatics to {Scikit}-{Learn}},\n\tauthor = {Bjerrum"
},
{
"path": "CONTRIBUTING.md",
"chars": 218,
"preview": "# Contribution\n\nFor up-to-date information, see\n\n[docs/contribution.md](docs/contributing.md)\n\nor\n\n[https://scikit-mol.r"
},
{
"path": "LICENSE",
"chars": 7487,
"preview": " GNU LESSER GENERAL PUBLIC LICENSE\n Version 3, 29 June 2007\n\nCopyright (C) 2007 "
},
{
"path": "MANIFEST.in",
"chars": 29,
"preview": "prune .github\nexclude .git*\n\n"
},
{
"path": "Makefile",
"chars": 239,
"preview": "sync-notebooks:\n\tuv run jupytext --set-formats docs//notebooks//ipynb,docs//notebooks//scripts//py:percent --sync docs/n"
},
{
"path": "README.md",
"chars": 9581,
"preview": "# scikit-mol\n\n {\n .md-main__inner {\n max-width: none;"
},
{
"path": "docs/assets/js/readthedocs.js",
"chars": 1059,
"preview": "// Add server-side search\ndocument.addEventListener(\"DOMContentLoaded\", function(event) {\n // Trigger Read the Docs' "
},
{
"path": "docs/contributing.md",
"chars": 7730,
"preview": "# Contribution\n\nThanks for your interest in contributing to the project. Please read on in the sections that apply.\n\n## "
},
{
"path": "docs/index.md",
"chars": 9581,
"preview": "# scikit-mol\n\n 2023 Olivier J. M. Béquignon\n\nPermission is hereby granted, free of charge, to any person obtaining a copy"
},
{
"path": "scikit_mol/applicability/README.md",
"chars": 884,
"preview": "# Applicability Domain Estimators\n\nThis module contains applicability domain estimators for chemical modeling.\n\n## Licen"
},
{
"path": "scikit_mol/applicability/__init__.py",
"chars": 1119,
"preview": "from .base import BaseApplicabilityDomain\nfrom .bounding_box import BoundingBoxApplicabilityDomain\nfrom .convex_hull imp"
},
{
"path": "scikit_mol/applicability/base.py",
"chars": 9402,
"preview": "\"\"\"Base class for applicability domain estimators.\"\"\"\n\nfrom abc import ABC, abstractmethod\nfrom typing import Any, Class"
},
{
"path": "scikit_mol/applicability/bounding_box.py",
"chars": 4741,
"preview": "\"\"\"\nBounding box applicability domain.\n\nThis module was adapted from [MLChemAD](https://github.com/OlivierBeq/MLChemAD)\n"
},
{
"path": "scikit_mol/applicability/convex_hull.py",
"chars": 3924,
"preview": "\"\"\"\nConvex hull applicability domain.\n\nThis module was adapted from [MLChemAD](https://github.com/OlivierBeq/MLChemAD)\nO"
},
{
"path": "scikit_mol/applicability/hotelling.py",
"chars": 4348,
"preview": "\"\"\"\nHotelling T² applicability domain.\n\nThis module was adapted from [MLChemAD](https://github.com/OlivierBeq/MLChemAD)\n"
},
{
"path": "scikit_mol/applicability/isolation_forest.py",
"chars": 5878,
"preview": "\"\"\"\nIsolation Forest applicability domain.\n\nThis module was adapted from [MLChemAD](https://github.com/OlivierBeq/MLChem"
},
{
"path": "scikit_mol/applicability/kernel_density.py",
"chars": 3865,
"preview": "\"\"\"\nKernel Density applicability domain.\n\nThis module was adapted from [MLChemAD](https://github.com/OlivierBeq/MLChemAD"
},
{
"path": "scikit_mol/applicability/knn.py",
"chars": 6364,
"preview": "\"\"\"\nK-Nearest Neighbors applicability domain.\n\nThis module was adapted from [MLChemAD](https://github.com/OlivierBeq/MLC"
},
{
"path": "scikit_mol/applicability/leverage.py",
"chars": 4615,
"preview": "\"\"\"\nLeverage-based applicability domain.\n\nThis module was adapted from [MLChemAD](https://github.com/OlivierBeq/MLChemAD"
},
{
"path": "scikit_mol/applicability/local_outlier.py",
"chars": 4727,
"preview": "\"\"\"\nLocal Outlier Factor applicability domain.\n\nThis module was adapted from [MLChemAD](https://github.com/OlivierBeq/ML"
},
{
"path": "scikit_mol/applicability/mahalanobis.py",
"chars": 4824,
"preview": "\"\"\"\nMahalanobis distance applicability domain.\n\"\"\"\n\nfrom typing import Any, Optional\n\nimport numpy as np\nfrom numpy.typi"
},
{
"path": "scikit_mol/applicability/standardization.py",
"chars": 3759,
"preview": "\"\"\"\nStandardization approach applicability domain.\n\nThis module was adapted from [MLChemAD](https://github.com/OlivierBe"
},
{
"path": "scikit_mol/applicability/topkat.py",
"chars": 5520,
"preview": "\"\"\"\nTOPKAT's Optimal Prediction Space (OPS) applicability domain.\n\nThis module was adapted from [MLChemAD](https://githu"
},
{
"path": "scikit_mol/conversions.py",
"chars": 4836,
"preview": "from collections.abc import Sequence\nfrom typing import Optional, Union\n\nimport numpy as np\nfrom numpy.typing import NDA"
},
{
"path": "scikit_mol/core.py",
"chars": 2954,
"preview": "\"\"\"\nCore functionality for scikit-mol.\n\nUsers of scikit-mol should not need to use this module directly.\nUsers who want "
},
{
"path": "scikit_mol/descriptors.py",
"chars": 6341,
"preview": "import functools\nfrom typing import List, Optional, Union\n\nimport numpy as np\nfrom rdkit.Chem import Descriptors\nfrom rd"
},
{
"path": "scikit_mol/fingerprints/__init__.py",
"chars": 1307,
"preview": "from scikit_mol._constants import DOCS_BASE_URL\n\nfrom .atompair import AtomPairFingerprintTransformer\nfrom .avalon impor"
},
{
"path": "scikit_mol/fingerprints/atompair.py",
"chars": 4126,
"preview": "from typing import Optional, Sequence\n\nimport numpy as np\nfrom rdkit.Chem.rdFingerprintGenerator import GetAtomPairGener"
},
{
"path": "scikit_mol/fingerprints/avalon.py",
"chars": 2427,
"preview": "from typing import Optional\n\nimport numpy as np\nfrom rdkit.Avalon import pyAvalonTools\n\nfrom .baseclasses import FpsTran"
},
{
"path": "scikit_mol/fingerprints/baseclasses.py",
"chars": 11044,
"preview": "import functools\nimport inspect\nimport re\nfrom abc import ABC, abstractmethod\nfrom typing import Optional, Type\nfrom war"
},
{
"path": "scikit_mol/fingerprints/maccs.py",
"chars": 1883,
"preview": "from typing import Optional\n\nimport numpy as np\nfrom rdkit.Chem import rdMolDescriptors\n\nfrom .baseclasses import FpsTra"
},
{
"path": "scikit_mol/fingerprints/minhash.py",
"chars": 7778,
"preview": "from typing import Optional\nfrom warnings import warn\n\nimport numpy as np\nfrom rdkit.Chem import rdMHFPFingerprint\n\nfrom"
},
{
"path": "scikit_mol/fingerprints/morgan.py",
"chars": 3602,
"preview": "from typing import Optional\n\nimport numpy as np\nfrom rdkit.Chem.rdFingerprintGenerator import (\n GetMorganFeatureAtom"
},
{
"path": "scikit_mol/fingerprints/rdkitfp.py",
"chars": 3884,
"preview": "from typing import Optional\n\nimport numpy as np\nfrom rdkit.Chem.rdFingerprintGenerator import GetRDKitFPGenerator\n\nfrom "
},
{
"path": "scikit_mol/fingerprints/topologicaltorsion.py",
"chars": 3861,
"preview": "from typing import Optional, Sequence\n\nimport numpy as np\nfrom rdkit.Chem.rdFingerprintGenerator import GetTopologicalTo"
},
{
"path": "scikit_mol/parallel.py",
"chars": 1570,
"preview": "from collections.abc import Sequence\nfrom typing import Any, Callable, Optional, Union\n\nimport numpy as np\nimport pandas"
},
{
"path": "scikit_mol/plotting.py",
"chars": 5408,
"preview": "import os\nimport platform\nimport re\nimport subprocess\nimport time\nfrom typing import TYPE_CHECKING, Optional, Sequence\n\n"
},
{
"path": "scikit_mol/safeinference.py",
"chars": 7795,
"preview": "\"\"\"Wrapper for sklearn estimators and pipelines to handle errors.\"\"\"\n\nimport warnings\nfrom functools import wraps\nfrom t"
},
{
"path": "scikit_mol/standardizer.py",
"chars": 4069,
"preview": "# A scikit-learn compatible molecule standardizer\n# Author: Son Ha\n\n\nimport functools\nfrom typing import Optional\n\nimpor"
},
{
"path": "scikit_mol/utilities.py",
"chars": 4529,
"preview": "# For a non-scikit-learn check smiles sanitizer class\n\nimport warnings\n\nimport pandas as pd\nfrom rdkit import Chem\nfrom "
},
{
"path": "setup.cfg",
"chars": 1219,
"preview": "[metadata]\nname = scikit_mol\nurl = https://github.com/EBjerrum/scikit-mol\ndownload_url = https://github.com/EBjerrum/sci"
},
{
"path": "tests/__init__.py",
"chars": 0,
"preview": ""
},
{
"path": "tests/applicability/__init__.py",
"chars": 0,
"preview": ""
},
{
"path": "tests/applicability/conftest.py",
"chars": 3016,
"preview": "import numpy as np\nimport pytest\nfrom sklearn.decomposition import PCA\nfrom sklearn.preprocessing import StandardScaler\n"
},
{
"path": "tests/applicability/test_base.py",
"chars": 4458,
"preview": "\"\"\"Common tests for all applicability domain estimators.\"\"\"\n\nimport numpy as np\nimport pytest\nfrom numpy.testing import "
},
{
"path": "tests/applicability/test_bounding_box.py",
"chars": 2105,
"preview": "\"\"\"Tests specific to Bounding Box applicability domain.\"\"\"\n\nimport numpy as np\nimport pytest\nfrom sklearn.exceptions imp"
},
{
"path": "tests/applicability/test_convex_hull.py",
"chars": 2243,
"preview": "\"\"\"Tests specific to Convex Hull applicability domain.\"\"\"\n\nimport numpy as np\nimport pytest\nfrom sklearn.decomposition i"
},
{
"path": "tests/applicability/test_hotelling.py",
"chars": 2518,
"preview": "\"\"\"Tests specific to Hotelling T² applicability domain.\"\"\"\n\nimport numpy as np\nimport pytest\nfrom sklearn.exceptions imp"
},
{
"path": "tests/applicability/test_isolation_forest.py",
"chars": 610,
"preview": "\"\"\"Tests specific to Isolation Forest applicability domain.\"\"\"\n\nimport numpy as np\n\nfrom scikit_mol.applicability import"
},
{
"path": "tests/applicability/test_kernel_density.py",
"chars": 1422,
"preview": "\"\"\"Tests for KernelDensityApplicabilityDomain.\"\"\"\n\nimport pytest\n\nfrom scikit_mol.applicability import KernelDensityAppl"
},
{
"path": "tests/applicability/test_knn.py",
"chars": 844,
"preview": "\"\"\"Tests specific to KNN applicability domain.\"\"\"\n\nimport numpy as np\nimport pytest\n\nfrom scikit_mol.applicability impor"
},
{
"path": "tests/applicability/test_leverage.py",
"chars": 2003,
"preview": "\"\"\"Tests specific to Leverage applicability domain.\"\"\"\n\nimport numpy as np\nimport pytest\nfrom sklearn.decomposition impo"
},
{
"path": "tests/applicability/test_local_outlier.py",
"chars": 1962,
"preview": "\"\"\"Tests for LocalOutlierFactorApplicabilityDomain.\"\"\"\n\nimport numpy as np\nimport pytest\n\nfrom scikit_mol.applicability "
},
{
"path": "tests/applicability/test_mahalanobis.py",
"chars": 2017,
"preview": "\"\"\"Tests for MahalanobisApplicabilityDomain.\"\"\"\n\nimport numpy as np\nimport pytest\nfrom numpy.testing import assert_array"
},
{
"path": "tests/applicability/test_standardization.py",
"chars": 2415,
"preview": "\"\"\"Tests for StandardizationApplicabilityDomain.\"\"\"\n\nimport numpy as np\nimport pytest\nfrom numpy.testing import assert_a"
},
{
"path": "tests/applicability/test_topkat.py",
"chars": 1645,
"preview": "\"\"\"Tests for TopkatApplicabilityDomain.\"\"\"\n\nimport numpy as np\nimport pytest\nfrom numpy.testing import assert_array_almo"
},
{
"path": "tests/conftest.py",
"chars": 1892,
"preview": "import hashlib\nimport shutil\nfrom pathlib import Path, PurePath\nfrom urllib.parse import urlsplit\nfrom urllib.request im"
},
{
"path": "tests/fixtures.py",
"chars": 5874,
"preview": "import os\nfrom pathlib import Path\n\nimport numpy as np\nimport pandas as pd\nimport pytest\nimport sklearn\nfrom packaging.v"
},
{
"path": "tests/test_desctransformer.py",
"chars": 7977,
"preview": "import time\n\nimport joblib\nimport numpy as np\nimport numpy.ma as ma\nimport pandas as pd\nimport pytest\nimport sklearn\nfro"
},
{
"path": "tests/test_fptransformers.py",
"chars": 8482,
"preview": "import pickle\nimport tempfile\n\nimport numpy as np\nimport pandas as pd\nimport pytest\nfrom rdkit import Chem\nfrom sklearn "
},
{
"path": "tests/test_fptransformersgenerator.py",
"chars": 7308,
"preview": "import pickle\nimport tempfile\n\nimport numpy as np\nimport pytest\nfrom sklearn import clone\n\nfrom scikit_mol.fingerprints "
},
{
"path": "tests/test_parameter_types.py",
"chars": 1921,
"preview": "import numpy as np\nimport pytest\nfrom rdkit import Chem\n\nfrom .fixtures import (\n atompair_transformer,\n mols_list"
},
{
"path": "tests/test_safeinferencemode.py",
"chars": 4902,
"preview": "import numpy as np\nimport pandas as pd\nimport pytest\nfrom sklearn.ensemble import RandomForestRegressor\nfrom sklearn.pip"
},
{
"path": "tests/test_sanitizer.py",
"chars": 2771,
"preview": "import numpy as np\nimport pandas as pd\nimport pytest\nfrom rdkit import Chem\n\nfrom scikit_mol.utilities import CheckSmile"
},
{
"path": "tests/test_scikit_mol.py",
"chars": 53,
"preview": "\ndef test_load_data(data):\n assert len(data) > 0\n\n"
},
{
"path": "tests/test_smilestomol.py",
"chars": 6052,
"preview": "import numpy as np\nimport pandas as pd\nimport pytest\nimport sklearn\nfrom packaging.version import Version\nfrom rdkit imp"
},
{
"path": "tests/test_transformers.py",
"chars": 9295,
"preview": "# checking that the new transformers can work within a scikitlearn pipeline of the kind\n# Pipeline([(\"s2m\", SmilesToMol("
},
{
"path": "uv.toml",
"chars": 30,
"preview": "required-version = \">=0.5.24\"\n"
}
]
About this extraction
This page contains the full source code of the EBjerrum/scikit-mol GitHub repository, extracted and formatted as plain text for AI agents and large language models (LLMs). The extraction includes 128 files (3.7 MB), approximately 984.6k tokens, and a symbol index with 377 extracted functions, classes, methods, constants, and types. Use this with OpenClaw, Claude, ChatGPT, Cursor, Windsurf, or any other AI tool that accepts text input. You can copy the full output to your clipboard or download it as a .txt file.
Extracted by GitExtract — free GitHub repo to text converter for AI. Built by Nikandr Surkov.