Full Code of mindsdb/lightwood for AI

main bfa472129cc3 cached
248 files
5.3 MB
1.4M tokens
861 symbols
1 requests
Download .txt
Showing preview only (5,627K chars total). Download the full file or copy to clipboard to get everything.
Repository: mindsdb/lightwood
Branch: main
Commit: bfa472129cc3
Files: 248
Total size: 5.3 MB

Directory structure:
gitextract__fmgkupe/

├── .deepsource.toml
├── .flake8
├── .github/
│   ├── ISSUE_TEMPLATE/
│   │   ├── bug_report.md
│   │   ├── question.md
│   │   └── suggestion.md
│   ├── PULL_REQUEST_TEMPLATE/
│   │   └── pull_request_template.md
│   └── workflows/
│       ├── add_to_docs_project.yml
│       ├── add_to_roadmap_project.yml
│       ├── benchmark_check.yml
│       ├── cla.yml
│       ├── doc_build.yml
│       └── lightwood.yml
├── .gitignore
├── .nojekyll
├── CODE_OF_CONDUCT.md
├── CONTRIBUTING.md
├── LICENSE
├── README.md
├── __init__.py
├── assets/
│   └── contributions-agreement/
│       └── signatures/
│           └── cla.json
├── docssrc/
│   ├── Makefile
│   ├── README.md
│   └── source/
│       ├── _static/
│       │   └── custom.css
│       ├── analysis.rst
│       ├── api/
│       │   ├── dtype.rst
│       │   ├── encode.rst
│       │   ├── high_level.rst
│       │   ├── json_ai.rst
│       │   ├── predictor.rst
│       │   └── types.rst
│       ├── api.rst
│       ├── conf.py
│       ├── data.rst
│       ├── encoder.rst
│       ├── ensemble.rst
│       ├── helpers.rst
│       ├── index.rst
│       ├── lightwood_philosophy.rst
│       ├── mixer.rst
│       ├── tutorials/
│       │   ├── README.md
│       │   ├── custom_cleaner/
│       │   │   └── custom_cleaner.ipynb
│       │   ├── custom_encoder_rulebased/
│       │   │   └── custom_encoder_rulebased.ipynb
│       │   ├── custom_explainer/
│       │   │   └── custom_explainer.ipynb
│       │   ├── custom_mixer/
│       │   │   └── custom_mixer.ipynb
│       │   ├── custom_splitter/
│       │   │   └── custom_splitter.ipynb
│       │   ├── tutorial_data_analysis/
│       │   │   └── tutorial_data_analysis.ipynb
│       │   ├── tutorial_time_series/
│       │   │   └── tutorial_time_series.ipynb
│       │   └── tutorial_update_models/
│       │       └── tutorial_update_models.ipynb
│       └── tutorials.rst
├── lightwood/
│   ├── __about__.py
│   ├── __init__.py
│   ├── analysis/
│   │   ├── __init__.py
│   │   ├── analyze.py
│   │   ├── base.py
│   │   ├── explain.py
│   │   ├── helpers/
│   │   │   ├── __init__.py
│   │   │   ├── acc_stats.py
│   │   │   ├── conf_stats.py
│   │   │   ├── feature_importance.py
│   │   │   ├── pyod.py
│   │   │   └── shap.py
│   │   ├── nc/
│   │   │   ├── LICENSE
│   │   │   ├── __init__.py
│   │   │   ├── base.py
│   │   │   ├── calibrate.py
│   │   │   ├── icp.py
│   │   │   ├── metrics.py
│   │   │   ├── nc.py
│   │   │   ├── norm.py
│   │   │   └── util.py
│   │   └── nn_conf/
│   │       ├── __init__.py
│   │       ├── temp_scale.py
│   │       └── temp_scale_license
│   ├── api/
│   │   ├── __init__.py
│   │   ├── high_level.py
│   │   ├── json_ai.py
│   │   ├── predictor.py
│   │   └── types.py
│   ├── data/
│   │   ├── __init__.py
│   │   ├── encoded_ds.py
│   │   ├── timeseries_analyzer.py
│   │   └── timeseries_transform.py
│   ├── encoder/
│   │   ├── __init__.py
│   │   ├── array/
│   │   │   ├── __init__.py
│   │   │   ├── array.py
│   │   │   ├── ts_cat_array.py
│   │   │   └── ts_num_array.py
│   │   ├── audio/
│   │   │   ├── __init__.py
│   │   │   └── mfcc.py
│   │   ├── base.py
│   │   ├── categorical/
│   │   │   ├── __init__.py
│   │   │   ├── autoencoder.py
│   │   │   ├── binary.py
│   │   │   ├── gym.py
│   │   │   ├── multihot.py
│   │   │   ├── onehot.py
│   │   │   └── simple_label.py
│   │   ├── datetime/
│   │   │   ├── __init__.py
│   │   │   ├── datetime.py
│   │   │   └── datetime_sin_normalizer.py
│   │   ├── helpers.py
│   │   ├── identity/
│   │   │   ├── __init__.py
│   │   │   └── identity.py
│   │   ├── image/
│   │   │   ├── __init__.py
│   │   │   ├── helpers/
│   │   │   │   ├── __init__.py
│   │   │   │   └── img_to_vec.py
│   │   │   └── img_2_vec.py
│   │   ├── numeric/
│   │   │   ├── __init__.py
│   │   │   ├── numeric.py
│   │   │   └── ts_numeric.py
│   │   ├── text/
│   │   │   ├── __init__.py
│   │   │   ├── helpers/
│   │   │   │   ├── __init__.py
│   │   │   │   └── pretrained_helpers.py
│   │   │   ├── pretrained.py
│   │   │   ├── short.py
│   │   │   ├── tfidf.py
│   │   │   └── vocab.py
│   │   └── time_series/
│   │       ├── __init__.py
│   │       ├── helpers/
│   │       │   ├── __init__.py
│   │       │   ├── common.py
│   │       │   ├── rnn_helpers.py
│   │       │   └── transformer_helpers.py
│   │       ├── rnn.py
│   │       └── ts.py
│   ├── ensemble/
│   │   ├── __init__.py
│   │   ├── base.py
│   │   ├── best_of.py
│   │   ├── embed.py
│   │   ├── identity.py
│   │   ├── mean_ensemble.py
│   │   ├── mode_ensemble.py
│   │   ├── stacked_ensemble.py
│   │   ├── ts_stacked_ensemble.py
│   │   └── weighted_mean_ensemble.py
│   ├── helpers/
│   │   ├── __init__.py
│   │   ├── codegen.py
│   │   ├── constants.py
│   │   ├── device.py
│   │   ├── general.py
│   │   ├── io.py
│   │   ├── log.py
│   │   ├── numeric.py
│   │   ├── parallelism.py
│   │   ├── seed.py
│   │   ├── templating.py
│   │   ├── text.py
│   │   ├── torch.py
│   │   └── ts.py
│   └── mixer/
│       ├── __init__.py
│       ├── arima.py
│       ├── base.py
│       ├── ets.py
│       ├── helpers/
│       │   ├── __init__.py
│       │   ├── ar_net.py
│       │   ├── default_net.py
│       │   ├── qclassic_net.py
│       │   ├── ranger.py
│       │   ├── residual_net.py
│       │   ├── transform_corss_entropy_loss.py
│       │   └── ts.py
│       ├── lightgbm.py
│       ├── lightgbm_array.py
│       ├── neural.py
│       ├── neural_ts.py
│       ├── nhits.py
│       ├── prophet.py
│       ├── qclassic.py
│       ├── random_forest.py
│       ├── regression.py
│       ├── sktime.py
│       ├── tabtransformer.py
│       ├── unit.py
│       ├── xgboost.py
│       └── xgboost_array.py
├── pyproject.toml
└── tests/
    ├── __init__.py
    ├── data/
    │   ├── airline_sentiment.csv
    │   ├── arrivals.csv
    │   ├── concrete_strength.csv
    │   ├── hdi.csv
    │   ├── house_sales.csv
    │   ├── ionosphere.csv
    │   ├── tripadvisor_binary_sample.csv
    │   └── wine_reviews_binary_sample.csv
    ├── integration/
    │   ├── __init__.py
    │   ├── advanced/
    │   │   ├── __init__.py
    │   │   ├── test_array.py
    │   │   ├── test_custom_modules.py
    │   │   ├── test_text_input.py
    │   │   └── test_timeseries.py
    │   └── basic/
    │       ├── __init__.py
    │       ├── notes.txt
    │       ├── test_airline.py
    │       ├── test_categorical.py
    │       ├── test_cleaner.py
    │       ├── test_embedding.py
    │       ├── test_ensembles.py
    │       ├── test_jsonai.py
    │       ├── test_model_selection.py
    │       ├── test_qclassic.py
    │       ├── test_regression.py
    │       ├── test_save_and_load.py
    │       └── test_weird_target_dist.py
    ├── unit_tests/
    │   ├── __init__.py
    │   ├── analysis/
    │   │   ├── __init__.py
    │   │   ├── test_nc_norm.py
    │   │   ├── test_pyod.py
    │   │   └── test_shap.py
    │   ├── api/
    │   │   └── README.md
    │   ├── data/
    │   │   ├── __init__.py
    │   │   └── test_transform_ts.py
    │   ├── encoder/
    │   │   ├── __init__.py
    │   │   ├── audio/
    │   │   │   ├── __init__.py
    │   │   │   └── test_mfcc.py
    │   │   ├── categorical/
    │   │   │   ├── __init__.py
    │   │   │   ├── test_autoencoder.py
    │   │   │   ├── test_binary.py
    │   │   │   ├── test_label.py
    │   │   │   ├── test_multihot.py
    │   │   │   └── test_onehot.py
    │   │   ├── date/
    │   │   │   ├── __init__.py
    │   │   │   └── test_datetime.py
    │   │   ├── identity/
    │   │   │   ├── __init__.py
    │   │   │   └── test_identity.py
    │   │   ├── images/
    │   │   │   ├── __init__.py
    │   │   │   └── test_img_2_vec.py
    │   │   ├── numeric/
    │   │   │   ├── __init__.py
    │   │   │   └── test_numeric.py
    │   │   ├── text/
    │   │   │   ├── __init__.py
    │   │   │   ├── neg.txt
    │   │   │   ├── pos.txt
    │   │   │   ├── test_pretrained.py
    │   │   │   ├── test_short.py
    │   │   │   ├── test_tfidf.py
    │   │   │   └── test_vocab.py
    │   │   └── time_series/
    │   │       ├── __init__.py
    │   │       ├── test_timeseries_rnn.py
    │   │       └── test_transformer.py
    │   ├── helpers.py
    │   └── mixer/
    │       ├── __init__.py
    │       ├── test_lgbm.py
    │       ├── test_nhits.py
    │       ├── test_random_forest.py
    │       ├── test_tabtransformer.py
    │       └── test_xgboost.py
    └── utils/
        ├── __init__.py
        ├── data_generation.py
        └── timing.py

================================================
FILE CONTENTS
================================================

================================================
FILE: .deepsource.toml
================================================
version = 1

[[analyzers]]
name = "python"
enabled = true

  [analyzers.meta]
  runtime_version = "3.x.x"


================================================
FILE: .flake8
================================================
[flake8]
max-line-length = 120
ignore = E275,E402,F821,W503,W504,C408,W391,E721
exclude = .git,__pycache__,docs,docssrc


================================================
FILE: .github/ISSUE_TEMPLATE/bug_report.md
================================================
---
name: Bug report
about: Create a report to help us improve
labels: bug
---

## Your Environment
* Python version:
* Operating system:
* Lightwood version:

## Describe your issue


## How can we replicate it?
* What dataset did you use (link to it please)
* What was the code you ran


================================================
FILE: .github/ISSUE_TEMPLATE/question.md
================================================
---
name: Question
about: Ask a question
labels: question
---

================================================
FILE: .github/ISSUE_TEMPLATE/suggestion.md
================================================
---
name: Suggestion
about: Suggest a feature, improvement, doc change, etc.
labels: enhancement
---





================================================
FILE: .github/PULL_REQUEST_TEMPLATE/pull_request_template.md
================================================
# Why is it needed ?

# What does it do ?


================================================
FILE: .github/workflows/add_to_docs_project.yml
================================================
name: Add issue to docs project

on:
  issues:
    types:
      - opened

jobs:
  add-to-project:
    name: Add issue to docs project
    runs-on: ubuntu-latest
    steps:
      - uses: actions/add-to-project@v0.4.0
        with:
          # You can target a repository in a different organization
          # to the issue
          project-url: https://github.com/orgs/mindsdb/projects/32
          github-token: ${{ secrets.ADD_TO_PROJECT_PAT }}
          labeled: documentation


================================================
FILE: .github/workflows/add_to_roadmap_project.yml
================================================
name: Add issue to roadmap project
on:
  issues:
    types:
      - opened
jobs:
  add-to-project:
    name: Add issue to roadmap project
    runs-on: ubuntu-latest
    steps:
      - uses: actions/add-to-project@v0.4.0
        with:
          project-url: https://github.com/orgs/mindsdb/projects/53
          github-token: ${{ secrets.ADD_TO_PROJECT_PAT }}
          labeled: bug, enhancement
          label-operator: OR

================================================
FILE: .github/workflows/benchmark_check.yml
================================================
name: Benchmark Result Check Lightwood

#on:
#  pull_request:
#    branches:
#      - main

jobs:
  check:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v2
      - name: Assert benchmarks ran and this version is better
        run: |
          wget https://raw.githubusercontent.com/torokmark/assert.sh/main/assert.sh
          source assert.sh
          commit=${{ github.event.pull_request.head.sha }}
          against="best"
          status=$(curl -X GET http://benchmarks.mindsdb.com:9107/compare/$against/$commit?release_only=True)
          echo Got benchmark status $status between $commit and $against
          assert_eq "$status" "Yes" && echo 'Good to go!'




================================================
FILE: .github/workflows/cla.yml
================================================
name: "Lightwood CLA Assistant"
on:
  issue_comment:
    types: [created]
  pull_request_target:
    types: [opened,closed,synchronize]

permissions:
  actions: write
  contents: write
  pull-requests: write
  statuses: write

jobs:
  CLAssistant:
    runs-on: mdb-dev
    steps:
      - name: "CLA Assistant"
        if: (github.event.comment.body == 'recheckcla' || github.event.comment.body == 'I have read the CLA Document and I hereby sign the CLA') || github.event_name == 'pull_request'
        uses: contributor-assistant/github-action@v2.6.1
        env:
          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
        with:
          path-to-signatures: 'assets/contributions-agreement/signatures/cla.json'
          path-to-cla-document: 'https://github.com/mindsdb/mindsdb_native/blob/stable/assets/contributions-agreement/individual-contributor.md'
          branch: 'main'
          allowlist: bot*, George3d6, ZoranPandovski, paxcema, torrmal, Stpmax, maximlopin, mindsdbadmin


================================================
FILE: .github/workflows/doc_build.yml
================================================
name: Documentation Build Lightwood

on:
  push:
    branches:
      - main
      - separate_doc_branch
      - jupyter_actions

jobs:
  doc_build:
    runs-on: ubuntu-latest
    permissions:
      contents: write

    steps:
      - name: checkout and set up
        uses: actions/checkout@v2

      - name: setup python
        uses: actions/setup-python@v2
        with:
          python-version: 3.11

      - name: install all dependencies
        run: |
          sudo apt install pandoc
          python -m pip install --upgrade pip
          pip install 'Sphinx==6.2.1' 'sphinx-autoapi==3.0.0' 'sphinx-autodoc-typehints' 'sphinx-code-include' 'sphinx-rtd-theme' 'sphinxcontrib-applehelp' 'sphinxcontrib-devhelp' 'sphinxcontrib-htmlhelp' 'sphinxcontrib-jsmath' 'sphinxcontrib-napoleon' 'sphinxcontrib-qthelp' 'sphinxcontrib-serializinghtml' autoapi nbsphinx myst_parser pandoc jupyter matplotlib imblearn fsspec
          pip install --no-cache-dir -e .
      - name: Install NLTK data
        run: |
          python -m nltk.downloader punkt
          python -m nltk.downloader punkt_tab
          python -m nltk.downloader stopwords
      - name: Re-run notebooks
        run: |
          find . -iname '*.ipynb' -exec jupyter nbconvert --to notebook --inplace --execute {} \;  > out.txt 2>&1
          cat out.txt
          cat out.txt | grep -zvqi exception && echo 'no errors detected' || exit
          cat out.txt | grep -zvqi error && echo 'no errors detected' || exit
      - name: Make the docs
        run: |
          cd docssrc && make github

      - name: Deploy to another branch
        uses: s0/git-publish-subdir-action@develop
        env:
          REPO: self
          BRANCH: gh-pages # The branch name where you want to push the assets
          FOLDER: docs # The directory where your assets are generated
          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }} # GitHub will automatically add this - you don't need to bother getting a token
          MESSAGE: "Rebuilt the docs" # The commit message


================================================
FILE: .github/workflows/lightwood.yml
================================================
name: Integration and Unit Tests Lightwood

on:
  push:
  pull_request:
    branches:
      - main
  release:
    types: [published]

jobs:
  test:
    runs-on: ${{ matrix.os }}
    strategy:
      matrix:
        os: [ubuntu-latest]
        python-version: ["3.10", "3.11", "3.12", "3.13"]
    steps:
      - uses: actions/checkout@v2
      - name: Set up Python ${{ matrix.python-version }}
        uses: actions/setup-python@v2
        with:
          python-version: ${{ matrix.python-version }}
      - name: Install dependencies
        run: |
          python -m pip install --upgrade pip
          python -m pip install poetry
          python -m pip install setuptools==57.5.0
          python -m pip install pydateinfer==0.3.0
          poetry install -E dev -E image
      - name: Install dependencies OSX
        run: |
          if [ "$RUNNER_OS" == "macOS" ]; then
            brew install libomp;
          fi
        shell: bash
        env:
          CHECK_FOR_UPDATES: False
      - name: Lint with flake8
        run: |
          poetry run python -m flake8 .
      - name: Install NLTK data
        run: |
          poetry run python -m nltk.downloader punkt
          poetry run python -m nltk.downloader punkt_tab
          poetry run python -m nltk.downloader stopwords
      - name: Test with unittest
        run: |
          # Run all the "standard" tests
          poetry run python -m unittest discover tests

  deploy:
    runs-on: ubuntu-latest
    environment: PublishCI
    needs: test
    if: github.event_name == 'release'
    steps:
      - uses: actions/checkout@v2
      - name: Set up Python
        uses: actions/setup-python@v2
        with:
          python-version: "3.10"
      - name: Install dependencies
        run: |
          python -m pip install --upgrade pip
          pip install poetry
      - name: Build
        run: poetry build
      - name: Publish
        env:
          POETRY_HTTP_BASIC_PYPI_USERNAME: __token__
          POETRY_HTTP_BASIC_PYPI_PASSWORD: ${{ secrets.PYPI_PASSWORD }}
        run: |
          poetry publish --dry-run
          poetry publish


================================================
FILE: .gitignore
================================================
*.pth
*.vec
*.pkl
*.dill
*.test.*
.cache*
*.jar
mindsdb.egg-info
.pypirc

# Byte-compiled / optimized / DLL files
*__pycache__*
*.py[cod]
*$py.class

# Distribution / packaging
.Python
env/
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
wheels/
*.egg-info/
.installed.cfg
*.egg

# visual studio code
.DStore
.DS_Store
.idea
.vscode

# virtualenv
.venv
venv/
ENV/

# pyenv
.python-version

# Installer logs
pip-log.txt
pip-delete-this-directory.txt

# Temporary
dynamic_predictor.py

test.pickle
AI.json
AI2.json

# docs
assert.sh
docssrc/build
docssrc/build/*
docssrc/source/tutorials/*.html
docssrc/source/tutorials/*.pickle
docssrc/source/tutorials/*.py
docs
docs/*
*.zip
docs/*
.ipynb_checkpoints

================================================
FILE: .nojekyll
================================================


================================================
FILE: CODE_OF_CONDUCT.md
================================================

# Contributor Covenant Code of Conduct

## Our Pledge

We as members, contributors, and leaders pledge to make participation in our community a harassment-free experience for everyone, regardless of age, body size, visible or invisible disability, ethnicity, sex characteristics, gender identity and expression, level of experience, education, socio-economic status, nationality, personal appearance, race, religion, or sexual identity and orientation.

We pledge to act and interact in ways that contribute to an open, welcoming, diverse, inclusive, and healthy community.

## Our Standards

Examples of behavior that contributes to a positive environment for our community include:

* Demonstrating empathy and kindness toward other people
* Being respectful of differing opinions, viewpoints, and experiences
* Giving and gracefully accepting constructive feedback
* Accepting responsibility and apologizing to those affected by our mistakes, and learning from the experience
* Focusing on what is best not just for us as individuals, but for the overall community

Examples of unacceptable behavior include:

* The use of sexualized language or imagery, and sexual attention or
  advances of any kind
* Trolling, insulting or derogatory comments, and personal or political attacks
* Public or private harassment
* Publishing others' private information, such as a physical or email
  address, without their explicit permission
* Other conduct which could reasonably be considered inappropriate in a
  professional setting

## Enforcement Responsibilities

Community leaders are responsible for clarifying and enforcing our standards of acceptable behavior and will take appropriate and fair corrective action in response to any behavior that they deem inappropriate, threatening, offensive, or harmful.

Community leaders have the right and responsibility to remove, edit, or reject comments, commits, code, wiki edits, issues, and other contributions that are not aligned to this Code of Conduct, and will communicate reasons for moderation decisions when appropriate.

## Scope

This Code of Conduct applies within all community spaces, and also applies when an individual is officially representing the community in public spaces. Examples of representing our community include using an official e-mail address, posting via an official social media account, or acting as an appointed representative at an online or offline event.

## Enforcement

Instances of abusive, harassing, or otherwise unacceptable behavior may be reported to the community leaders responsible for enforcement at admin@mindsdb.com. All complaints will be reviewed and investigated promptly and fairly.

All community leaders are obligated to respect the privacy and security of the reporter of any incident.


## Attribution

This Code of Conduct is adapted from the [Contributor Covenant][homepage], version 2.0,
available at https://www.contributor-covenant.org/version/2/0/code_of_conduct.html.

Community Impact Guidelines were inspired by [Mozilla's code of conduct enforcement ladder](https://github.com/mozilla/diversity).

[homepage]: https://www.contributor-covenant.org

For answers to common questions about this code of conduct, see the FAQ at
https://www.contributor-covenant.org/faq. Translations are available at https://www.contributor-covenant.org/translations.


================================================
FILE: CONTRIBUTING.md
================================================

# Contribute to Lightwood

We love to receive contributions from the community and hear your opinions! We want to make contributing to Lightwood as easy as it can be.

Being part of the core Lightwood team is possible to anyone who is motivated and wants to be part of that journey!

Please continue reading this guide if you are interested in helping democratize machine learning.

### Hacktoberfest 2021
If you are participating in this year's Hacktoberfest event, please scroll down to read the relevant guidelines for the event.

## How can you help us?

* Report a bug
* Improve documentation
* Solve an issue
* Propose new features
* Discuss feature implementations
* Submit a bug fix
* Test Lightwood with your own data and let us know how it went!

## Code contributions
In general, we follow the ["fork-and-pull"](https://docs.github.com/en/github/collaborating-with-pull-requests/getting-started/about-collaborative-development-models#fork-and-pull-model) git workflow. Here are the steps:

1. Fork the Lightwood repository
2. Make changes and commit them 
3. Make sure that the CI tests pass. You can run the test suite locally with `flake8 .` to check style and `python -m unittest discover tests` to run the automated tests. This doesn't guarantee it will pass remotely since we run on multiple envs, but should work in most cases.
4. Push your local branch to your fork
5. Submit a pull request from your repo to the `main` branch of `mindsdb/lightwood` so that we can review your changes. Be sure to merge the latest from main before making a pull request!


> Note: You will need to sign a CLI agreement for the code since lightwood is under a GPL license. 

## Feature and Bug reports
We use GitHub issues to track bugs and features. Report them by opening a [new issue](https://github.com/mindsdb/lightwood/issues/new/choose) and fill out all of the required inputs.

## Code review process
Pull request reviews are done on a regular basis. 

If your change has a chance to affecting performance we will run our private benchmark suite to validate it.

Please, make sure you respond to our feedback/questions.

# Community
If you have additional questions or you want to chat with MindsDB core team, you can join our community: <a href="https://join.slack.com/t/mindsdbcommunity/shared_invite/zt-o8mrmx3l-5ai~5H66s6wlxFfBMVI6wQ" target="_blank"><img src="https://img.shields.io/badge/slack-@mindsdbcommunity-blueviolet.svg?logo=slack " alt="MindsDB Community"></a>.

To get updates on Lightwood and MindsDB’s latest announcements, releases, and events, sign up for our [Monthly Community Newsletter](https://mindsdb.com/newsletter/?utm_medium=community&utm_source=github&utm_campaign=lightwood%20repo).

Join our mission of democratizing machine learning and allowing developers to become data scientists!

# Hacktoberfest 2021

We are very excited that Lightwood is participating in this year's Hacktoberfest 2021 event. This month-long event through October gives you the chance to contribute to the Open Source codebase of Lightwood and MindsDB!

The Lightwood core team has prepared several issues of different types that are ideal for first-time contributors and will be posted throughout the month. It's entirely up to you what you choose to work on and if you have your own great idea, feel free to suggest it by reaching out to us via our Slack community or by posting an issue with the `discussion` tag.

**Our Major Incentive and SWAG!** 

Make contributions and enter into the draw for a [Deep Learning Laptop](https://lambdalabs.com/deep-learning/laptops/tensorbook) **powered by the NVIDIA RTX 3080 Max-Q GPU**. Pre-installed with TensorFlow, PyTorch, CUDA, cuDNN and more.

![Deep Learning Laptop](/assets/laptop.jpeg)

Also we’d love to send you a special MindsDB SWAG gift pack:

![MindsDB Swag](/assets/swag.png)


#### How to participate

1. Contribute by making pull requests to any of our open issues labeled with the `hacktoberfest` tag during October. All hacktoberfest issues will specify how many points a successfully merged PR is worth.
2. Have a total score of at least 5 points in order to enter the big prize draw.
3. Complete the form with links to all your completed PR’s so we know where to ship the gift pack to!

Entries close at midnight (PST) Sunday, 31 October 2021 with the prize draw winner announced at an online event on Monday, 1st of November.

Please check https://mindsdb.com/hacktoberfest for more details.


**Remember:** if you wish to contribute with something that is not currently flagged as a hacktoberfest issue, make an issue (or make a comment if an issue already exists), and let's talk about it!


## Contributor Code of Conduct
Please note that this project is released with a [Contributor Code of Conduct](https://github.com/mindsdb/lightwood/blob/main/CODE_OF_CONDUCT.md). By participating in this project, you agree to abide by its terms.


================================================
FILE: LICENSE
================================================
GNU GENERAL PUBLIC LICENSE
                       Version 3, 29 June 2007

 Copyright (C) 2007 Free Software Foundation, Inc. <https://fsf.org/>
 Everyone is permitted to copy and distribute verbatim copies
 of this license document, but changing it is not allowed.

                            Preamble

  The GNU General Public License is a free, copyleft license for
software and other kinds of works.

  The licenses for most software and other practical works are designed
to take away your freedom to share and change the works.  By contrast,
the GNU General Public License is intended to guarantee your freedom to
share and change all versions of a program--to make sure it remains free
software for all its users.  We, the Free Software Foundation, use the
GNU General Public License for most of our software; it applies also to
any other work released this way by its authors.  You can apply it to
your programs, too.

  When we speak of free software, we are referring to freedom, not
price.  Our General Public Licenses are designed to make sure that you
have the freedom to distribute copies of free software (and charge for
them if you wish), that you receive source code or can get it if you
want it, that you can change the software or use pieces of it in new
free programs, and that you know you can do these things.

  To protect your rights, we need to prevent others from denying you
these rights or asking you to surrender the rights.  Therefore, you have
certain responsibilities if you distribute copies of the software, or if
you modify it: responsibilities to respect the freedom of others.

  For example, if you distribute copies of such a program, whether
gratis or for a fee, you must pass on to the recipients the same
freedoms that you received.  You must make sure that they, too, receive
or can get the source code.  And you must show them these terms so they
know their rights.

  Developers that use the GNU GPL protect your rights with two steps:
(1) assert copyright on the software, and (2) offer you this License
giving you legal permission to copy, distribute and/or modify it.

  For the developers' and authors' protection, the GPL clearly explains
that there is no warranty for this free software.  For both users' and
authors' sake, the GPL requires that modified versions be marked as
changed, so that their problems will not be attributed erroneously to
authors of previous versions.

  Some devices are designed to deny users access to install or run
modified versions of the software inside them, although the manufacturer
can do so.  This is fundamentally incompatible with the aim of
protecting users' freedom to change the software.  The systematic
pattern of such abuse occurs in the area of products for individuals to
use, which is precisely where it is most unacceptable.  Therefore, we
have designed this version of the GPL to prohibit the practice for those
products.  If such problems arise substantially in other domains, we
stand ready to extend this provision to those domains in future versions
of the GPL, as needed to protect the freedom of users.

  Finally, every program is threatened constantly by software patents.
States should not allow patents to restrict development and use of
software on general-purpose computers, but in those that do, we wish to
avoid the special danger that patents applied to a free program could
make it effectively proprietary.  To prevent this, the GPL assures that
patents cannot be used to render the program non-free.

  The precise terms and conditions for copying, distribution and
modification follow.

                       TERMS AND CONDITIONS

  0. Definitions.

  "This License" refers to version 3 of the GNU General Public License.

  "Copyright" also means copyright-like laws that apply to other kinds of
works, such as semiconductor masks.

  "The Program" refers to any copyrightable work licensed under this
License.  Each licensee is addressed as "you".  "Licensees" and
"recipients" may be individuals or organizations.

  To "modify" a work means to copy from or adapt all or part of the work
in a fashion requiring copyright permission, other than the making of an
exact copy.  The resulting work is called a "modified version" of the
earlier work or a work "based on" the earlier work.

  A "covered work" means either the unmodified Program or a work based
on the Program.

  To "propagate" a work means to do anything with it that, without
permission, would make you directly or secondarily liable for
infringement under applicable copyright law, except executing it on a
computer or modifying a private copy.  Propagation includes copying,
distribution (with or without modification), making available to the
public, and in some countries other activities as well.

  To "convey" a work means any kind of propagation that enables other
parties to make or receive copies.  Mere interaction with a user through
a computer network, with no transfer of a copy, is not conveying.

  An interactive user interface displays "Appropriate Legal Notices"
to the extent that it includes a convenient and prominently visible
feature that (1) displays an appropriate copyright notice, and (2)
tells the user that there is no warranty for the work (except to the
extent that warranties are provided), that licensees may convey the
work under this License, and how to view a copy of this License.  If
the interface presents a list of user commands or options, such as a
menu, a prominent item in the list meets this criterion.

  1. Source Code.

  The "source code" for a work means the preferred form of the work
for making modifications to it.  "Object code" means any non-source
form of a work.

  A "Standard Interface" means an interface that either is an official
standard defined by a recognized standards body, or, in the case of
interfaces specified for a particular programming language, one that
is widely used among developers working in that language.

  The "System Libraries" of an executable work include anything, other
than the work as a whole, that (a) is included in the normal form of
packaging a Major Component, but which is not part of that Major
Component, and (b) serves only to enable use of the work with that
Major Component, or to implement a Standard Interface for which an
implementation is available to the public in source code form.  A
"Major Component", in this context, means a major essential component
(kernel, window system, and so on) of the specific operating system
(if any) on which the executable work runs, or a compiler used to
produce the work, or an object code interpreter used to run it.

  The "Corresponding Source" for a work in object code form means all
the source code needed to generate, install, and (for an executable
work) run the object code and to modify the work, including scripts to
control those activities.  However, it does not include the work's
System Libraries, or general-purpose tools or generally available free
programs which are used unmodified in performing those activities but
which are not part of the work.  For example, Corresponding Source
includes interface definition files associated with source files for
the work, and the source code for shared libraries and dynamically
linked subprograms that the work is specifically designed to require,
such as by intimate data communication or control flow between those
subprograms and other parts of the work.

  The Corresponding Source need not include anything that users
can regenerate automatically from other parts of the Corresponding
Source.

  The Corresponding Source for a work in source code form is that
same work.

  2. Basic Permissions.

  All rights granted under this License are granted for the term of
copyright on the Program, and are irrevocable provided the stated
conditions are met.  This License explicitly affirms your unlimited
permission to run the unmodified Program.  The output from running a
covered work is covered by this License only if the output, given its
content, constitutes a covered work.  This License acknowledges your
rights of fair use or other equivalent, as provided by copyright law.

  You may make, run and propagate covered works that you do not
convey, without conditions so long as your license otherwise remains
in force.  You may convey covered works to others for the sole purpose
of having them make modifications exclusively for you, or provide you
with facilities for running those works, provided that you comply with
the terms of this License in conveying all material for which you do
not control copyright.  Those thus making or running the covered works
for you must do so exclusively on your behalf, under your direction
and control, on terms that prohibit them from making any copies of
your copyrighted material outside their relationship with you.

  Conveying under any other circumstances is permitted solely under
the conditions stated below.  Sublicensing is not allowed; section 10
makes it unnecessary.

  3. Protecting Users' Legal Rights From Anti-Circumvention Law.

  No covered work shall be deemed part of an effective technological
measure under any applicable law fulfilling obligations under article
11 of the WIPO copyright treaty adopted on 20 December 1996, or
similar laws prohibiting or restricting circumvention of such
measures.

  When you convey a covered work, you waive any legal power to forbid
circumvention of technological measures to the extent such circumvention
is effected by exercising rights under this License with respect to
the covered work, and you disclaim any intention to limit operation or
modification of the work as a means of enforcing, against the work's
users, your or third parties' legal rights to forbid circumvention of
technological measures.

  4. Conveying Verbatim Copies.

  You may convey verbatim copies of the Program's source code as you
receive it, in any medium, provided that you conspicuously and
appropriately publish on each copy an appropriate copyright notice;
keep intact all notices stating that this License and any
non-permissive terms added in accord with section 7 apply to the code;
keep intact all notices of the absence of any warranty; and give all
recipients a copy of this License along with the Program.

  You may charge any price or no price for each copy that you convey,
and you may offer support or warranty protection for a fee.

  5. Conveying Modified Source Versions.

  You may convey a work based on the Program, or the modifications to
produce it from the Program, in the form of source code under the
terms of section 4, provided that you also meet all of these conditions:

    a) The work must carry prominent notices stating that you modified
    it, and giving a relevant date.

    b) The work must carry prominent notices stating that it is
    released under this License and any conditions added under section
    7.  This requirement modifies the requirement in section 4 to
    "keep intact all notices".

    c) You must license the entire work, as a whole, under this
    License to anyone who comes into possession of a copy.  This
    License will therefore apply, along with any applicable section 7
    additional terms, to the whole of the work, and all its parts,
    regardless of how they are packaged.  This License gives no
    permission to license the work in any other way, but it does not
    invalidate such permission if you have separately received it.

    d) If the work has interactive user interfaces, each must display
    Appropriate Legal Notices; however, if the Program has interactive
    interfaces that do not display Appropriate Legal Notices, your
    work need not make them do so.

  A compilation of a covered work with other separate and independent
works, which are not by their nature extensions of the covered work,
and which are not combined with it such as to form a larger program,
in or on a volume of a storage or distribution medium, is called an
"aggregate" if the compilation and its resulting copyright are not
used to limit the access or legal rights of the compilation's users
beyond what the individual works permit.  Inclusion of a covered work
in an aggregate does not cause this License to apply to the other
parts of the aggregate.

  6. Conveying Non-Source Forms.

  You may convey a covered work in object code form under the terms
of sections 4 and 5, provided that you also convey the
machine-readable Corresponding Source under the terms of this License,
in one of these ways:

    a) Convey the object code in, or embodied in, a physical product
    (including a physical distribution medium), accompanied by the
    Corresponding Source fixed on a durable physical medium
    customarily used for software interchange.

    b) Convey the object code in, or embodied in, a physical product
    (including a physical distribution medium), accompanied by a
    written offer, valid for at least three years and valid for as
    long as you offer spare parts or customer support for that product
    model, to give anyone who possesses the object code either (1) a
    copy of the Corresponding Source for all the software in the
    product that is covered by this License, on a durable physical
    medium customarily used for software interchange, for a price no
    more than your reasonable cost of physically performing this
    conveying of source, or (2) access to copy the
    Corresponding Source from a network server at no charge.

    c) Convey individual copies of the object code with a copy of the
    written offer to provide the Corresponding Source.  This
    alternative is allowed only occasionally and noncommercially, and
    only if you received the object code with such an offer, in accord
    with subsection 6b.

    d) Convey the object code by offering access from a designated
    place (gratis or for a charge), and offer equivalent access to the
    Corresponding Source in the same way through the same place at no
    further charge.  You need not require recipients to copy the
    Corresponding Source along with the object code.  If the place to
    copy the object code is a network server, the Corresponding Source
    may be on a different server (operated by you or a third party)
    that supports equivalent copying facilities, provided you maintain
    clear directions next to the object code saying where to find the
    Corresponding Source.  Regardless of what server hosts the
    Corresponding Source, you remain obligated to ensure that it is
    available for as long as needed to satisfy these requirements.

    e) Convey the object code using peer-to-peer transmission, provided
    you inform other peers where the object code and Corresponding
    Source of the work are being offered to the general public at no
    charge under subsection 6d.

  A separable portion of the object code, whose source code is excluded
from the Corresponding Source as a System Library, need not be
included in conveying the object code work.

  A "User Product" is either (1) a "consumer product", which means any
tangible personal property which is normally used for personal, family,
or household purposes, or (2) anything designed or sold for incorporation
into a dwelling.  In determining whether a product is a consumer product,
doubtful cases shall be resolved in favor of coverage.  For a particular
product received by a particular user, "normally used" refers to a
typical or common use of that class of product, regardless of the status
of the particular user or of the way in which the particular user
actually uses, or expects or is expected to use, the product.  A product
is a consumer product regardless of whether the product has substantial
commercial, industrial or non-consumer uses, unless such uses represent
the only significant mode of use of the product.

  "Installation Information" for a User Product means any methods,
procedures, authorization keys, or other information required to install
and execute modified versions of a covered work in that User Product from
a modified version of its Corresponding Source.  The information must
suffice to ensure that the continued functioning of the modified object
code is in no case prevented or interfered with solely because
modification has been made.

  If you convey an object code work under this section in, or with, or
specifically for use in, a User Product, and the conveying occurs as
part of a transaction in which the right of possession and use of the
User Product is transferred to the recipient in perpetuity or for a
fixed term (regardless of how the transaction is characterized), the
Corresponding Source conveyed under this section must be accompanied
by the Installation Information.  But this requirement does not apply
if neither you nor any third party retains the ability to install
modified object code on the User Product (for example, the work has
been installed in ROM).

  The requirement to provide Installation Information does not include a
requirement to continue to provide support service, warranty, or updates
for a work that has been modified or installed by the recipient, or for
the User Product in which it has been modified or installed.  Access to a
network may be denied when the modification itself materially and
adversely affects the operation of the network or violates the rules and
protocols for communication across the network.

  Corresponding Source conveyed, and Installation Information provided,
in accord with this section must be in a format that is publicly
documented (and with an implementation available to the public in
source code form), and must require no special password or key for
unpacking, reading or copying.

  7. Additional Terms.

  "Additional permissions" are terms that supplement the terms of this
License by making exceptions from one or more of its conditions.
Additional permissions that are applicable to the entire Program shall
be treated as though they were included in this License, to the extent
that they are valid under applicable law.  If additional permissions
apply only to part of the Program, that part may be used separately
under those permissions, but the entire Program remains governed by
this License without regard to the additional permissions.

  When you convey a copy of a covered work, you may at your option
remove any additional permissions from that copy, or from any part of
it.  (Additional permissions may be written to require their own
removal in certain cases when you modify the work.)  You may place
additional permissions on material, added by you to a covered work,
for which you have or can give appropriate copyright permission.

  Notwithstanding any other provision of this License, for material you
add to a covered work, you may (if authorized by the copyright holders of
that material) supplement the terms of this License with terms:

    a) Disclaiming warranty or limiting liability differently from the
    terms of sections 15 and 16 of this License; or

    b) Requiring preservation of specified reasonable legal notices or
    author attributions in that material or in the Appropriate Legal
    Notices displayed by works containing it; or

    c) Prohibiting misrepresentation of the origin of that material, or
    requiring that modified versions of such material be marked in
    reasonable ways as different from the original version; or

    d) Limiting the use for publicity purposes of names of licensors or
    authors of the material; or

    e) Declining to grant rights under trademark law for use of some
    trade names, trademarks, or service marks; or

    f) Requiring indemnification of licensors and authors of that
    material by anyone who conveys the material (or modified versions of
    it) with contractual assumptions of liability to the recipient, for
    any liability that these contractual assumptions directly impose on
    those licensors and authors.

  All other non-permissive additional terms are considered "further
restrictions" within the meaning of section 10.  If the Program as you
received it, or any part of it, contains a notice stating that it is
governed by this License along with a term that is a further
restriction, you may remove that term.  If a license document contains
a further restriction but permits relicensing or conveying under this
License, you may add to a covered work material governed by the terms
of that license document, provided that the further restriction does
not survive such relicensing or conveying.

  If you add terms to a covered work in accord with this section, you
must place, in the relevant source files, a statement of the
additional terms that apply to those files, or a notice indicating
where to find the applicable terms.

  Additional terms, permissive or non-permissive, may be stated in the
form of a separately written license, or stated as exceptions;
the above requirements apply either way.

  8. Termination.

  You may not propagate or modify a covered work except as expressly
provided under this License.  Any attempt otherwise to propagate or
modify it is void, and will automatically terminate your rights under
this License (including any patent licenses granted under the third
paragraph of section 11).

  However, if you cease all violation of this License, then your
license from a particular copyright holder is reinstated (a)
provisionally, unless and until the copyright holder explicitly and
finally terminates your license, and (b) permanently, if the copyright
holder fails to notify you of the violation by some reasonable means
prior to 60 days after the cessation.

  Moreover, your license from a particular copyright holder is
reinstated permanently if the copyright holder notifies you of the
violation by some reasonable means, this is the first time you have
received notice of violation of this License (for any work) from that
copyright holder, and you cure the violation prior to 30 days after
your receipt of the notice.

  Termination of your rights under this section does not terminate the
licenses of parties who have received copies or rights from you under
this License.  If your rights have been terminated and not permanently
reinstated, you do not qualify to receive new licenses for the same
material under section 10.

  9. Acceptance Not Required for Having Copies.

  You are not required to accept this License in order to receive or
run a copy of the Program.  Ancillary propagation of a covered work
occurring solely as a consequence of using peer-to-peer transmission
to receive a copy likewise does not require acceptance.  However,
nothing other than this License grants you permission to propagate or
modify any covered work.  These actions infringe copyright if you do
not accept this License.  Therefore, by modifying or propagating a
covered work, you indicate your acceptance of this License to do so.

  10. Automatic Licensing of Downstream Recipients.

  Each time you convey a covered work, the recipient automatically
receives a license from the original licensors, to run, modify and
propagate that work, subject to this License.  You are not responsible
for enforcing compliance by third parties with this License.

  An "entity transaction" is a transaction transferring control of an
organization, or substantially all assets of one, or subdividing an
organization, or merging organizations.  If propagation of a covered
work results from an entity transaction, each party to that
transaction who receives a copy of the work also receives whatever
licenses to the work the party's predecessor in interest had or could
give under the previous paragraph, plus a right to possession of the
Corresponding Source of the work from the predecessor in interest, if
the predecessor has it or can get it with reasonable efforts.

  You may not impose any further restrictions on the exercise of the
rights granted or affirmed under this License.  For example, you may
not impose a license fee, royalty, or other charge for exercise of
rights granted under this License, and you may not initiate litigation
(including a cross-claim or counterclaim in a lawsuit) alleging that
any patent claim is infringed by making, using, selling, offering for
sale, or importing the Program or any portion of it.

  11. Patents.

  A "contributor" is a copyright holder who authorizes use under this
License of the Program or a work on which the Program is based.  The
work thus licensed is called the contributor's "contributor version".

  A contributor's "essential patent claims" are all patent claims
owned or controlled by the contributor, whether already acquired or
hereafter acquired, that would be infringed by some manner, permitted
by this License, of making, using, or selling its contributor version,
but do not include claims that would be infringed only as a
consequence of further modification of the contributor version.  For
purposes of this definition, "control" includes the right to grant
patent sublicenses in a manner consistent with the requirements of
this License.

  Each contributor grants you a non-exclusive, worldwide, royalty-free
patent license under the contributor's essential patent claims, to
make, use, sell, offer for sale, import and otherwise run, modify and
propagate the contents of its contributor version.

  In the following three paragraphs, a "patent license" is any express
agreement or commitment, however denominated, not to enforce a patent
(such as an express permission to practice a patent or covenant not to
sue for patent infringement).  To "grant" such a patent license to a
party means to make such an agreement or commitment not to enforce a
patent against the party.

  If you convey a covered work, knowingly relying on a patent license,
and the Corresponding Source of the work is not available for anyone
to copy, free of charge and under the terms of this License, through a
publicly available network server or other readily accessible means,
then you must either (1) cause the Corresponding Source to be so
available, or (2) arrange to deprive yourself of the benefit of the
patent license for this particular work, or (3) arrange, in a manner
consistent with the requirements of this License, to extend the patent
license to downstream recipients.  "Knowingly relying" means you have
actual knowledge that, but for the patent license, your conveying the
covered work in a country, or your recipient's use of the covered work
in a country, would infringe one or more identifiable patents in that
country that you have reason to believe are valid.

  If, pursuant to or in connection with a single transaction or
arrangement, you convey, or propagate by procuring conveyance of, a
covered work, and grant a patent license to some of the parties
receiving the covered work authorizing them to use, propagate, modify
or convey a specific copy of the covered work, then the patent license
you grant is automatically extended to all recipients of the covered
work and works based on it.

  A patent license is "discriminatory" if it does not include within
the scope of its coverage, prohibits the exercise of, or is
conditioned on the non-exercise of one or more of the rights that are
specifically granted under this License.  You may not convey a covered
work if you are a party to an arrangement with a third party that is
in the business of distributing software, under which you make payment
to the third party based on the extent of your activity of conveying
the work, and under which the third party grants, to any of the
parties who would receive the covered work from you, a discriminatory
patent license (a) in connection with copies of the covered work
conveyed by you (or copies made from those copies), or (b) primarily
for and in connection with specific products or compilations that
contain the covered work, unless you entered into that arrangement,
or that patent license was granted, prior to 28 March 2007.

  Nothing in this License shall be construed as excluding or limiting
any implied license or other defenses to infringement that may
otherwise be available to you under applicable patent law.

  12. No Surrender of Others' Freedom.

  If conditions are imposed on you (whether by court order, agreement or
otherwise) that contradict the conditions of this License, they do not
excuse you from the conditions of this License.  If you cannot convey a
covered work so as to satisfy simultaneously your obligations under this
License and any other pertinent obligations, then as a consequence you may
not convey it at all.  For example, if you agree to terms that obligate you
to collect a royalty for further conveying from those to whom you convey
the Program, the only way you could satisfy both those terms and this
License would be to refrain entirely from conveying the Program.

  13. Use with the GNU Affero General Public License.

  Notwithstanding any other provision of this License, you have
permission to link or combine any covered work with a work licensed
under version 3 of the GNU Affero General Public License into a single
combined work, and to convey the resulting work.  The terms of this
License will continue to apply to the part which is the covered work,
but the special requirements of the GNU Affero General Public License,
section 13, concerning interaction through a network will apply to the
combination as such.

  14. Revised Versions of this License.

  The Free Software Foundation may publish revised and/or new versions of
the GNU General Public License from time to time.  Such new versions will
be similar in spirit to the present version, but may differ in detail to
address new problems or concerns.

  Each version is given a distinguishing version number.  If the
Program specifies that a certain numbered version of the GNU General
Public License "or any later version" applies to it, you have the
option of following the terms and conditions either of that numbered
version or of any later version published by the Free Software
Foundation.  If the Program does not specify a version number of the
GNU General Public License, you may choose any version ever published
by the Free Software Foundation.

  If the Program specifies that a proxy can decide which future
versions of the GNU General Public License can be used, that proxy's
public statement of acceptance of a version permanently authorizes you
to choose that version for the Program.

  Later license versions may give you additional or different
permissions.  However, no additional obligations are imposed on any
author or copyright holder as a result of your choosing to follow a
later version.

  15. Disclaimer of Warranty.

  THERE IS NO WARRANTY FOR THE PROGRAM, TO THE EXTENT PERMITTED BY
APPLICABLE LAW.  EXCEPT WHEN OTHERWISE STATED IN WRITING THE COPYRIGHT
HOLDERS AND/OR OTHER PARTIES PROVIDE THE PROGRAM "AS IS" WITHOUT WARRANTY
OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO,
THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
PURPOSE.  THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE PROGRAM
IS WITH YOU.  SHOULD THE PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF
ALL NECESSARY SERVICING, REPAIR OR CORRECTION.

  16. Limitation of Liability.

  IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING
WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MODIFIES AND/OR CONVEYS
THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES, INCLUDING ANY
GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING OUT OF THE
USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED TO LOSS OF
DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY YOU OR THIRD
PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER PROGRAMS),
EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF
SUCH DAMAGES.

  17. Interpretation of Sections 15 and 16.

  If the disclaimer of warranty and limitation of liability provided
above cannot be given local legal effect according to their terms,
reviewing courts shall apply local law that most closely approximates
an absolute waiver of all civil liability in connection with the
Program, unless a warranty or assumption of liability accompanies a
copy of the Program in return for a fee.

                     END OF TERMS AND CONDITIONS

            How to Apply These Terms to Your New Programs

  If you develop a new program, and you want it to be of the greatest
possible use to the public, the best way to achieve this is to make it
free software which everyone can redistribute and change under these terms.

  To do so, attach the following notices to the program.  It is safest
to attach them to the start of each source file to most effectively
state the exclusion of warranty; and each file should have at least
the "copyright" line and a pointer to where the full notice is found.

    MindsDB, AutoML framework
    Copyright (C) 2019  MindsDB Inc (Jorge Torres)

    This program is free software: you can redistribute it and/or modify
    it under the terms of the GNU General Public License as published by
    the Free Software Foundation, either version 3 of the License, or
    (at your option) any later version.

    This program is distributed in the hope that it will be useful,
    but WITHOUT ANY WARRANTY; without even the implied warranty of
    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
    GNU General Public License for more details.

    You should have received a copy of the GNU General Public License
    along with this program.  If not, see <https://www.gnu.org/licenses/>.

Also add information on how to contact you by electronic and paper mail.

  If the program does terminal interaction, make it output a short
notice like this when it starts in an interactive mode:

    <MindsDB>  Copyright (C) <2019>  <MindsDB Inc, (Jorge Torres)>
    This program comes with ABSOLUTELY NO WARRANTY; for details type `show w'.
    This is free software, and you are welcome to redistribute it
    under certain conditions; type `show c' for details.

The hypothetical commands `show w' and `show c' should show the appropriate
parts of the General Public License.  Of course, your program's commands
might be different; for a GUI interface, you would use an "about box".

  You should also get your employer (if you work as a programmer) or school,
if any, to sign a "copyright disclaimer" for the program, if necessary.
For more information on this, and how to apply and follow the GNU GPL, see
<https://www.gnu.org/licenses/>.

  The GNU General Public License does not permit incorporating your program
into proprietary programs.  If your program is a subroutine library, you
may consider it more useful to permit linking proprietary applications with
the library.  If this is what you want to do, use the GNU Lesser General
Public License instead of this License.  But first, please read
<https://www.gnu.org/licenses/why-not-lgpl.html>.


================================================
FILE: README.md
================================================
# Lightwood

<!--- badges here? --->

Lightwood is an AutoML framework that enables you to generate and customize machine learning pipelines declarative syntax called JSON-AI.

Our goal is to make the data science/machine learning (DS/ML) life cycle easier by allowing users to focus on **what** they want to do their data without needing to write repetitive boilerplate code around machine learning and data preparation. Instead, we enable you to focus on the parts of a model that are truly unique and custom.

Lightwood works with a variety of data types such as numbers, dates, categories, tags, text, arrays and various multimedia formats. These data types can be combined together to solve complex problems. We also support a time-series mode for problems that have between-row dependencies.

Our JSON-AI syntax allows users to change any and all parts of the models Lightwood automatically generates. The syntax outlines the specifics details in each step of the modeling pipeline. Users may override default values (for example, changing the type of a column) or alternatively, entirely replace steps with their own methods (ex: use a random forest model for a predictor). Lightwood creates a "JSON-AI" object from this syntax which can then be used to automatically generate python code to represent your pipeline.

For details on how to generate JSON-AI syntax and how Lightwood works, check out the [Lightwood Philosophy](#Lightwood-Philosophy).

## Lightwood Philosophy

Lightwood abstracts the ML pipeline into 3 core steps:

(1) Pre-processing and data cleaning <br>
(2) Feature engineering <br>
(3) Model building and training <br>

<p align="center">
<img src="/assets/lightwood.png" alt="Lightwood internals" width="800"/>
</p>

#### i) Pre-processing and cleaning
For each column in your dataset, Lightwood will identify the suspected data type (numeric, categorical, etc.) via a brief statistical analysis. From this, it will generate a JSON-AI syntax. 

If the user keeps default behavior, Lightwood will perform a brief pre-processing approach to clean each column according to its identified data type. From there, it will split the data into train/dev/test splits.

The `cleaner` and `splitter` objects respectively refer to the pre-processing and the data splitting functions.

#### ii) Feature Engineering
Data can be converted into features via "encoders". Encoders represent the rules for transforming pre-processed data into a numerical representations that a model can be used. 

Encoders can be **rule-based** or **learned**. A rule-based encoder transforms data per a specific set of instructions (ex: normalized numerical data) whereas a learned encoder produces a representation of the data after training (ex: a "\[CLS\]" token in a language model).

Encoders are assigned to each column of data based on the data type; users can override this assignment either at the column-based level or at the data-type based level. Encoders inherit from the `BaseEncoder` class. 

#### iii) Model Building and Training
We call a predictive model that intakes *encoded* feature data and outputs a prediction for the target of interest a `mixer` model. Users can either use Lightwood's default mixers or create their own approaches inherited from the `BaseMixer` class.

We predominantly use PyTorch based approaches, but can support other models.

## Usage

We invite you to check out our [documentation](https://mindsdb.github.io/lightwood/) for specific guidelines and tutorials! Please stay tuned for updates and changes. 

### Quick use cases
Lightwood works with `pandas.DataFrames`. Once a DataFrame is loaded, defined a "ProblemDefinition" via a dictionary. The only thing a user needs to specify is the name of the column to predict (via the key `target`).

Create a JSON-AI syntax from the command `json_ai_from_problem`. Lightwood can then use this object to *automatically generate python code filling in the steps of the ML pipeline* via `code_from_json_ai`. 

You can make a `Predictor` object, instantiated with that code via `predictor_from_code`. 

To train a `Predictor` end-to-end, starting with unprocessed data, users can use the `predictor.learn()` command with the data.

```python
import pandas as pd
from lightwood.api.high_level import (
    ProblemDefinition,
    json_ai_from_problem,
    code_from_json_ai,
    predictor_from_code,
)

if __name__ == '__main__':
    # Load a pandas dataset
    df = pd.read_csv("https://raw.githubusercontent.com/mindsdb/benchmarks/main/benchmarks/datasets/hdi/data.csv"
    )

    # Define the prediction task by naming the target column
    pdef = ProblemDefinition.from_dict(
        {
            "target": "Development Index",  # column you want to predict
        }
    )

    # Generate JSON-AI code to model the problem
    json_ai = json_ai_from_problem(df, problem_definition=pdef)

    # OPTIONAL - see the JSON-AI syntax
    # print(json_ai.to_json())

    # Generate python code
    code = code_from_json_ai(json_ai)

    # OPTIONAL - see generated code
    # print(code)

    # Create a predictor from python code
    predictor = predictor_from_code(code)

    # Train a model end-to-end from raw data to a finalized predictor
    predictor.learn(df)

    # Make the train/test splits and show predictions for a few examples
    test_df = predictor.split(predictor.preprocess(df))["test"]
    preds = predictor.predict(test_df).iloc[:10]
    print(preds)
```

### BYOM: Bring your own models

Lightwood supports user architectures/approaches so long as you follow the abstractions provided within each step. 

Our [tutorials](https://mindsdb.github.io/lightwood/tutorials.html) provide specific use cases for how to introduce customization into your pipeline. Check out "custom cleaner", "custom splitter", "custom explainer", and "custom mixer". Stay tuned for further updates.


## Installation

You can install Lightwood as follows:

```python
pip3 install lightwood
```
>Note: depending on your environment, you might have to use pip instead of pip3 in the above command.

However, we recommend creating a python virtual environment.

#### Setting up a dev environment
- Python version should be in the range >=3.8, < 3.11
- Clone lightwood
- `cd lightwood && pip install -r requirements.txt && pip install -r requirements_image.txt`
- Add it to your python path (e.g. by adding `export PYTHONPATH='/where/you/cloned/lightwood':$PYTHONPATH` as a newline at the end of your `~/.bashrc` file)
- Check that the `unittest`s are passing by going into the directory where you cloned lightwood and running: `python -m unittest discover tests` 

> If `python` default to python2.x on your environment use `python3` and `pip3` instead

Currently, the preferred environment for working with lightwood is visual studio code, a very popular python IDE. However, any IDE should work. While we don't have guides for those, please feel free to use the following section as a template for VSCode, or to contribute your own tips and tricks to set up other IDEs.

#### Setting up a VSCode environment

* Install and enable setting sync using github account (if you use multiple machines)
* Install pylance (for types) and make sure to disable pyright
* Go to `Python > Lint: Enabled` and disable everything *but* flake8
* Set `python.linting.flake8Path` to the full path to flake8 (which flake8)
* Set `Python › Formatting: Provider` to autopep8
* Add `--global-config=<path_to>/lightwood/.flake8` and `--experimental` to `Python › Formatting: Autopep8 Args`
* Install live share and live share whiteboard


<!--- CONTRIBUTING.md ---->

## Contribute to Lightwood

We love to receive contributions from the community and hear your opinions! We want to make contributing to Lightwood as easy as it can be.

Being part of the core Lightwood team is possible to anyone who is motivated and wants to be part of that journey!

Please continue reading this guide if you are interested in helping democratize machine learning.

### How can you help us?

* Report a bug
* Improve documentation
* Solve an issue
* Propose new features
* Discuss feature implementations
* Submit a bug fix
* Test Lightwood with your own data and let us know how it went!

### Code contributions
In general, we follow the ["fork-and-pull"](https://docs.github.com/en/github/collaborating-with-pull-requests/getting-started/about-collaborative-development-models#fork-and-pull-model) git workflow. Here are the steps:

1. Fork the Lightwood repository
2. Make changes and commit them 
3. Make sure that the CI tests pass. You can run the test suite locally with `flake8 .` to check style and `python -m unittest discover tests` to run the automated tests. This doesn't guarantee it will pass remotely since we run on multiple envs, but should work in most cases.
4. Push your local branch to your fork
5. Submit a pull request from your repo to the `main` branch of `mindsdb/lightwood` so that we can review your changes. Be sure to merge the latest from main before making a pull request!

> Note: You will need to sign a CLI agreement for the code since lightwood is under a GPL license. 

### Feature and Bug reports
We use GitHub issues to track bugs and features. Report them by opening a [new issue](https://github.com/mindsdb/lightwood/issues/new/choose) and fill out all of the required inputs.

### Code review process
Pull request (PR) reviews are done on a regular basis. **If your PR does not address a previous issue, please make an issue first**.

If your change has a chance to affecting performance we will run our private benchmark suite to validate it.

Please, make sure you respond to our feedback/questions.

# Community
If you have additional questions or you want to chat with MindsDB core team, you can join our community: <a href="https://join.slack.com/t/mindsdbcommunity/shared_invite/zt-o8mrmx3l-5ai~5H66s6wlxFfBMVI6wQ" target="_blank"><img src="https://img.shields.io/badge/slack-@mindsdbcommunity-blueviolet.svg?logo=slack " alt="MindsDB Community"></a>.

To get updates on Lightwood and MindsDB’s latest announcements, releases, and events, sign up for our [Monthly Community Newsletter](https://mindsdb.com/newsletter/?utm_medium=community&utm_source=github&utm_campaign=lightwood%20repo).

Join our mission of democratizing machine learning and allowing developers to become data scientists!

## Contributor Code of Conduct
Please note that this project is released with a [Contributor Code of Conduct](https://github.com/mindsdb/lightwood/blob/main/CODE_OF_CONDUCT.md). By participating in this project, you agree to abide by its terms.


# Current contributors 

<a href="https://github.com/mindsdb/lightwood/graphs/contributors">
  <img src="https://contributors-img.web.app/image?repo=mindsdb/lightwood" />
</a>

# License ![PyPI - License](https://img.shields.io/pypi/l/lightwood)

* [Lightwood License](https://github.com/mindsdb/lightwood/blob/master/LICENSE)



================================================
FILE: __init__.py
================================================
name = "lightwood"


================================================
FILE: assets/contributions-agreement/signatures/cla.json
================================================
{
   "signedContributors": []
}

================================================
FILE: docssrc/Makefile
================================================
# Minimal makefile for Sphinx documentation
#

# You can set these variables from the command line, and also
# from the environment for the first two.
SPHINXOPTS    ?=
SPHINXBUILD   ?= sphinx-build
SOURCEDIR     = source
BUILDDIR      = build

# Put it first so that "make" without argument is like "make help".
help:
	@$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)

.PHONY: help Makefile

# Catch-all target: route all unknown targets to Sphinx using the new
# "make mode" option.  $(O) is meant as a shortcut for $(SPHINXOPTS).
%: Makefile
	@$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)

github:
	@make -b html
	@touch ../docs
	@rm -r ../docs
	@cp -a build/html/. ../docs
	@rm -r build
	@touch ../docs/.nojekyll
	@echo https://mindsdb.github.io/lightwood > ../docs/CNAME


================================================
FILE: docssrc/README.md
================================================
## Compiling the docs
- Make sure you are in `docssrc`, then follow the instructions under `run` in our [documentation building github actions job](https://github.com/mindsdb/lightwood/blob/main/.github/workflows/doc_build.yml#L21)
- Then go into the newly build docs and start a server to see them: `cd ../docs && python3 -m http.server`
- Should now be available at: 0.0.0.0:8000 | Alternatively, you can just open the `index.html` with a browser and that should work too

## Ref

for how autosummary works: https://stackoverflow.com/questions/2701998/sphinx-autodoc-is-not-automatic-enough

## Manual steps

currently notebooks have to be built manually using: `find . -iname '*.ipynb' -exec jupyter nbconvert --to notebook --inplace --execute {} \;`

================================================
FILE: docssrc/source/_static/custom.css
================================================
/* override css for readable.css */

/*  styles/fonts to match http://mdanalysis.org (see public/css) */
/* MindsDB --shamrock: #00b06d; */
/* MindsDB --dark: #2c263f;  */
/* MindsDB --aqua-marine: #4dd9ca;  */
/* MindsDB --wheat: #fedc8c;  */
/* MindsDB --watermelon: #f25c63;  */
/* MindsDB --blueberry: #6751ad;  */
/* MindsDB --white: #ffffff;  */
/* MindsDB --slate-grey: #5d6970;  */

/*
.wy-nav-side {
    font-size: 1.6em;
    font-weight: 500;
}
*/

.wy-nav-side .toctree-l1 {
    font-size: 1.5em;
    font-weight: 500;  
}

.wy-nav-side .toctree-l2 {
    font-size: 0.7em !important;
    font-weight: 400 !important;  
}

.wy-nav-side .toctree-l3 {
    font-size: 0.9em !important;
    font-weight: 400 !important;  
}

/* .rst-content dl.class dt, .rst-content dl.function dt */

.field-list dt.field-odd {
    font-size: 13px !important;
    color:#2c263f !important;
    padding-left: 0rem !important;
}

.field-list dt.field-even {
    font-size: 13px !important;
    color:#2c263f !important;
    padding-left: 0rem !important;
}

.field-list dd.field-odd {
    font-size: 13px !important;
    color:#2c263f !important;
    margin-left: 12px !important;
}

.field-list dd.field-even {
    font-size: 13px !important;
    color:#2c263f !important;
    margin-left: 12px !important;
}


.sig {
    background:rgba(254, 220, 140, 0.3) !important;
    border-top: solid 0px #2c263f !important;
    border-left: 0px !important;
    padding-left: 8px;
    padding-right: 6px;
    padding-top: 6px;
    padding-bottom: 6px;
}

.sig .sig-prename {
    color: #2c263f;
}

.sig .sig-name {
    color: #2c263f;
}

.sig .sig-paren {
    color: rgb(93, 105, 112);
}
.sig .sig-param {
    color: rgb(93, 105, 112);
}

.sig .property {
    color: rgb(93, 105, 112);
}

div.rst-content a {
    color: #00b06d;
    text-decoration: none;
}

div.rst-content a:visited {
    color: #00b06d;
}
 
a:hover {
    color: #00b06d !important;
    text-decoration: underline;
}

/*
body {
    font-family: 'PT Sans', Helvetica, Arial, 'sans-serif';
    font-size: 17px;    
}

div.body {
    color: #000000;
}

div.sphinxsidebar a:hover {
    text-decoration: none !important;
}

div.sphinxsidebar p {
    color: #2c263f;
}

// Home MDAnalysis colour
.wy-side-nav-search > a {
    color: #343131;
}

// Side MDAnalysis version colour
.wy-side-nav-search > div.version {
    color: #2c263f;
}

// Menubar caption colour
div.wy-menu-vertical span.caption-text {
    color: #00b06d;
}

// Mobile layout menubar option
nav.wy-nav-top {
    background: #343131;
}

// Menu search bar outline (default blue)
.wy-side-nav-search input[type="text"] {
    border-color: #2c263f;
}


// -- body styles ---------------------------------------------------------

// Different coloured links for sidebar vs body)


pre, tt, code {
    font-family: Menlo, Monaco, 'Courier New', monospace
}


div.body h1 {
    font-weight: bolder;
}

a.headerlink {
    color: #2c263f;
    font-size: 0.8em;
    padding: 0 4px 0 4px;
    text-decoration: none;
}
 
a.headerlink:hover {
    background-color: #2c263f;
    color: #fff;
}

// ------- admonition boxes ------- 

div.admonition {
    margin: 10px 0px;
    padding: 10px 10px;
}

div.admonition p.admonition-title {
    font-size: 100%;
    font-weight: bolder;
}

// ----- Tables -----

// override table width restrictions
// wrap tables instead of scrolling 

@media screen and (min-width: 767px) {

    .wy-table-responsive table td, .wy-table-responsive table th {
       // !important prevents the common CSS stylesheets from overriding
          this as on RTD they are loaded after this stylesheet
       white-space: normal !important;
    }
 
    .wy-table-responsive {
       overflow: visible !important;
       max-width: 100% !important;
    }
 }

// ----- Field lists ------ 

.section > dl.field-list {
    display: flex;
    flex-wrap: wrap;
    margin: 0;
    padding: 0;
}

dl.field-list > dt::after {
    content: ":";
}

.rst-content dl:not(.docutils) dt {
    background: none;
    color: #000000;
    border-top: none;
}

.section > dl.field-list dt {
    margin: 0;
    padding: 0;
    flex-basis: 20%;
    display: block;
}

.section > dl.field-list > dd {
    flex-basis: 70%;
    margin: 0;
}

.section > dl.field-list > dd p {
    margin: 0;
}

// ----- MDAnalysis coloured elements ------ 

.rst-content .viewcode-link, .rst-content .viewcode-back {
    color: #2c263f;
}

.rst-content .guilabel {
    background: #efefef;
    border: 1px solid #2c263f;
}


.rst-content .seealso p.admonition-title {
    background: #2c263f;
}

.rst-content .seealso {
    background: #e3e3e3;
}

.rst-content  .error p.admonition-title, .rst-content  .warning p.admonition-title {
    background: #F45F4B;
}

.rst-content .error, .rst-content .warning {
    background: #FFEEED;
}



.rst-content .caution, .rst-content .note, .rst-content .important {
    background: #FFEBD0;
}

.rst-content code:not(.xref).literal {
    color: #ca6500;
}

.rst-content .caution p.admonition-title, .rst-content .note p.admonition-title, .rst-content .important p.admonition-title  {
    background: #00b06d;
}

.rst-content dl.class dt, .rst-content dl.function dt {
    color: #ca6500;
    background: #FFEBD0;
    border-top: solid 3px #00b06d;
}

*/


================================================
FILE: docssrc/source/analysis.rst
================================================
:mod:`Analysis`
==========================

Analyse mixer ensembles to extract static insights and train predict-time models for dynamic insights.

.. automodule:: analysis
   :members:

================================================
FILE: docssrc/source/api/dtype.rst
================================================
Data Types (dtypes)
--------------------
Lightwood supports several data types used in standard machine learning pipelines. The ``dtype`` class is used to label columns of information as the right input format. The type inference procedure affects what feature engineering methodology is used on a labeled column.

Currently, the supported way to encourage new data types is to include a custom tag in this file and to import a custom cleaning approach. Users may inherit the basic functionality of the cleaner and include their own flag specific to their data type. For steps on how to do this, please see the tutorials.

.. autoclass:: api.dtype.dtype
   :members:

================================================
FILE: docssrc/source/api/encode.rst
================================================
Encode your data
--------------------

.. automodule:: api.encode
   :members:

================================================
FILE: docssrc/source/api/high_level.rst
================================================
JSON-AI Config
--------------------

.. automodule:: api.high_level
   :members:

================================================
FILE: docssrc/source/api/json_ai.rst
================================================
JSON-AI Config
--------------------

.. automodule:: api.json_ai
   :members:

================================================
FILE: docssrc/source/api/predictor.rst
================================================
Predictor Interface
--------------------
The ``PredictorInterface`` creates the skeletal structure around basic functionality of Lightwood.

.. automodule:: api.predictor
   :members:

================================================
FILE: docssrc/source/api/types.rst
================================================
Lightwood API Types
--------------------
Lightwood consists of several high level abstractions to enable the data science/machine learning (DS/ML) pipeline in a step-by-step procedure.

.. automodule:: api.types
   :members:
   :member-order: bysource

================================================
FILE: docssrc/source/api.rst
================================================
:mod:`API`
==========================

The API module is how Lightwood interfaces with the user.

.. toctree::
   :maxdepth: 1
   :caption: Table of Contents:

   api/high_level
   api/dtype
   api/types
   api/predictor
   api/json_ai
   api/encode

================================================
FILE: docssrc/source/conf.py
================================================
# Configuration file for the Sphinx documentation builder.
#
# This file only contains a selection of the most common options. For a full
# list see the documentation:
# https://www.sphinx-doc.org/en/master/usage/configuration.html

# 2021.09.07
# Strongly inspired from
# https://github.com/MDAnalysis/mdanalysis/blob/master/package/doc/sphinx/source/conf.py

import sys
import os
import datetime


# ----------------- #
# Project information
# ----------------- #
sys.path.append(os.path.abspath('../../lightwood'))

# ----------------- #
# Project information
# ----------------- #
project = 'lightwood'
copyright = '2021, MindsDB'
authors = "MindsDB"
# author = 'Natasha Seelam (natasha@mindsdb.com)'
now = datetime.datetime.now()
copyright = u'2017-{}, '.format(now.year) + authors

# Version of the package
packageversion = __import__('lightwood').__version__

version = packageversion
release = packageversion

# ----------------- #
# Master document
# ----------------- #
master_doc = "index"

# ----------------- #
# General Config
# ----------------- #

# Enable sphinx extensions
extensions = [
    'sphinx.ext.autodoc',
    'autoapi.extension',
    'sphinx.ext.autosectionlabel',
    'sphinx_autodoc_typehints',
    'myst_parser',
    'sphinx_rtd_theme',
    'sphinx.ext.viewcode',
    'sphinx.ext.napoleon',
    'nbsphinx'
]

# Enable markdown usage
source_suffix = {
    '.rst': 'restructuredtext',
    '.txt': 'markdown',
    '.md': 'markdown',
}

source_parsers = {'.md': 'recommonmark.parser.CommonMarkParser'}

# Templates
templates_path = ['_templates']

# List of patterns, relative to source directory, that match files and
# directories to ignore when looking for source files.
# This pattern also affects html_static_path and html_extra_path.
exclude_patterns = []

# ----------------- #
# Formatting and theme
# ----------------- #

# Default colors
colors = {
    'shamrock': '#00b06d',
    'dark_blue': '#2c263f',
    'aqua': '#4dd9ca',
    'wheat': '#fedc8c',
    'watermelon': '#f25c63',
    'blueberry': '#6751ad',
    'white': '#ffffff',
    'slate': '#5d6970',
}

# HTML details
html_theme = 'sphinx_rtd_theme'


html_theme_options = {
    'canonical_url': '',
    'logo_only': True,
    'display_version': True,
    'prev_next_buttons_location': 'bottom',
    'style_external_links': False,
    'style_nav_header_background': 'white',
    # Toc options
    'collapse_navigation': True,
    'sticky_navigation': True,
    'navigation_depth': 4,
    'includehidden': True,
    'titles_only': False,
}

# Pygments syntax highlight themes
pygments_style = 'sphinx'

# to include decorated objects like __init__
autoclass_content = 'both'

# Add any paths that contain custom static files (such as style sheets) here,
# relative to this directory. They are copied after the builtin static files,
# so a file named "default.css" will overwrite the builtin "default.css".
html_static_path = ['_static']
html_css_files = ['custom.css']


# Brand logo
html_logo = "_static/logos/mindsdblogo.png"

html_sidebars = {
    '**': [
        'about.html',
        'navigation.html',
        'relations.html',
        'searchbox.html',
    ]
}

# ----------------- #
# Autodoc capability
# ----------------- #
autoapi_template_dir = '_autoapi_templates'
autoapi_root = 'docs'
autoapi_generate_api_docs = False

autoapi_dirs = ['../../lightwood']

# autodoc_member_order = 'bysource' # Keep order of the members accordingly


================================================
FILE: docssrc/source/data.rst
================================================
:mod:`Data`
==========================

The focus of these modules is on storing, transforming, cleaning, splitting, merging, getting and removing data.

.. automodule:: data
   :members:
   :show-inheritance:

================================================
FILE: docssrc/source/encoder.rst
================================================
:mod:`Encoders`
==========================

Used for encoding data into PyTorch tensors and decoding it from pytorch tensors

.. automodule:: encoder
   :members:


================================================
FILE: docssrc/source/ensemble.rst
================================================
:mod:`Ensemble`
==========================

Ensemble mixers together in order to generate predictions

.. automodule:: ensemble
   :members:

================================================
FILE: docssrc/source/helpers.rst
================================================
:mod:`Helpers`
==========================

Various helper functions

.. automodule:: helpers
   :members:

================================================
FILE: docssrc/source/index.rst
================================================
.. -*- coding: utf-8 -*-
.. lightwood_docs documentation master file, created by
   sphinx-quickstart on Tue Sep  7 13:07:48 2021.
   You can adapt this file completely to your liking, but it should at least
   contain the root ``toctree`` directive.

****************************************
Lightwood
****************************************

:Release: |release|
:Date: |today|
| 
Lightwood is an AutoML framework that enables you to generate and customize machine learning pipelines declarative syntax called JSON-AI.

Our goal is to make the data science/machine learning (DS/ML) life cycle easier by allowing users to focus on **what** they want to do their data without needing to write repetitive boilerplate code around machine learning and data preparation. Instead, we enable you to focus on the parts of a model that are truly unique and custom.

Lightwood works with a variety of data types such as numbers, dates, categories, tags, text, arrays and various multimedia formats. These data types can be combined together to solve complex problems. We also support a time-series mode for problems that have between-row dependencies.

Our JSON-AI syntax allows users to change any and all parts of the models Lightwood automatically generates. The syntax outlines the specifics details in each step of the modeling pipeline. Users may override default values (for example, changing the type of a column) or alternatively, entirely replace steps with their own methods (ex: use a random forest model for a predictor). Lightwood creates a "JSON-AI" object from this syntax which can then be used to automatically generate python code to represent your pipeline.

For details as to how Lightwood works, check out the `Lightwood Philosophy <https://mindsdb.github.io/lightwood/lightwood_philosophy.html>`_ .

Quick Guide
=======================
- :ref:`Installation <Installation>`
- :ref:`Example Use Cases <Example Use Cases>`
- :ref:`Contribute to Lightwood <Contribute to Lightwood>`

Installation
============

You can install Lightwood as follows:

.. code-block:: bash

   pip3 install lightwood

.. note:: depending on your environment, you might have to use pip instead of pip3 in the above command.

However, we recommend creating a python virtual environment.

Setting up a dev environment
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
- Clone lightwood
- Run ``cd lightwood && pip install requirements.txt``
- Add it to your python path (e.g. by adding ``export PYTHONPATH='/where/you/cloned/lightwood':$PYTHONPATH`` as a newline at the end of your ``~/.bashrc`` file)
- Check that the unit-tests are passing by going into the directory where you cloned lightwood and running: ``python -m unittest discover tests`` 

.. warning:: If ``python`` default to python2.x on your environment use ``python3`` and ``pip3`` instead

Currently, the preferred environment for working with lightwood is visual studio code, a very popular python IDE. However, any IDE should work. While we don't have guides for those, please feel free to use the following section as a template for VSCode, or to contribute your own tips and tricks to set up other IDEs.

Setting up a VSCode environment
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

* Install and enable setting sync using github account (if you use multiple machines)
* Install pylance (for types) and make sure to disable pyright
* Go to ``Python > Lint: Enabled`` and disable everything *but* flake8
* Set ``python.linting.flake8Path`` to the full path to flake8 (which flake8)
* Set ``Python › Formatting: Provider`` to autopep8
* Add ``--global-config=<path_to>/lightwood/.flake8`` and ``--experimental`` to ``Python › Formatting: Autopep8 Args``
* Install live share and live share whiteboard


Example Use Cases
=======================

Lightwood works with ``pandas.DataFrames``. Once a DataFrame is loaded, defined a "ProblemDefinition" via a dictionary. The only thing a user needs to specify is the name of the column to predict (via the key ``target``).

Create a JSON-AI syntax from the command ``json_ai_from_problem``. Lightwood can then use this object to *automatically generate python code filling in the steps of the ML pipeline* via ``code_from_json_ai``. 

You can make a ``Predictor`` object, instantiated with that code via ``predictor_from_code``. 

To train a ``Predictor`` end-to-end, starting with unprocessed data, users can use the ``predictor.learn()`` command with the data.

.. code-block:: python

   import pandas as pd
   from lightwood.api.high_level import (
       ProblemDefinition,
       json_ai_from_problem,
       code_from_json_ai,
       predictor_from_code,
   )

   # Load a pandas dataset
   df = pd.read_csv(
       "https://raw.githubusercontent.com/mindsdb/benchmarks/main/benchmarks/datasets/hdi/data.csv"
   )

   # Define the prediction task by naming the target column
   pdef = ProblemDefinition.from_dict(
       {
           "target": "Development Index",  # column you want to predict
       }
   )

   # Generate JSON-AI code to model the problem
   json_ai = json_ai_from_problem(df, problem_definition=pdef)

   # OPTIONAL - see the JSON-AI syntax
   #print(json_ai.to_json())

   # Generate python code
   code = code_from_json_ai(json_ai)

   # OPTIONAL - see generated code
   #print(code)

   # Create a predictor from python code
   predictor = predictor_from_code(code)

   # Train a model end-to-end from raw data to a finalized predictor
   predictor.learn(df)

   # Make the train/test splits and show predictions for a few examples
   test_df = predictor.split(predictor.preprocess(df))["test"]
   preds = predictor.predict(test_df).iloc[:10]
   print(preds)

BYOM: Bring your own models
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Lightwood supports user architectures/approaches so long as you follow the abstractions provided within each step. 

Our `tutorials <https://mindsdb.github.io/lightwood/tutorials.html>`_ provide specific use cases for how to introduce customization into your pipeline. Check out "custom cleaner", "custom splitter", "custom explainer", and "custom mixer". Stay tuned for further updates.


Contribute to Lightwood
=======================

We love to receive contributions from the community and hear your opinions! We want to make contributing to Lightwood as easy as it can be.

Being part of the core Lightwood team is possible to anyone who is motivated and wants to be part of that journey!

Please continue reading this guide if you are interested in helping democratize machine learning.

How can you help us?
^^^^^^^^^^^^^^^^^^^^^^^^
* Report a bug
* Improve documentation
* Solve an issue
* Propose new features
* Discuss feature implementations
* Submit a bug fix
* Test Lightwood with your own data and let us know how it went!

Code contributions
^^^^^^^^^^^^^^^^^^^^^^^^
In general, we follow the `fork-and-pull <https://docs.github.com/en/github/collaborating-with-pull-requests/getting-started/about-collaborative-development-models#fork-and-pull-model>`_ git workflow. Here are the steps:

1. Fork the Lightwood repository
2. Make changes and commit them 
3. Make sure that the CI tests pass. You can run the test suite locally with ``flake8 .`` to check style and ``python -m unittest discover tests`` to run the automated tests. This doesn't guarantee it will pass remotely since we run on multiple envs, but should work in most cases.
4. Push your local branch to your fork
5. Submit a pull request from your repo to the ``main`` branch of ``mindsdb/lightwood`` so that we can review your changes. Be sure to merge the latest from main before making a pull request!

.. note:: You will need to sign a CLI agreement for the code since lightwood is under a GPL license. 


Feature and Bug reports
^^^^^^^^^^^^^^^^^^^^^^^^
We use GitHub issues to track bugs and features. Report them by opening a `new issue <https://github.com/mindsdb/lightwood/issues/new/choose) and fill out all of the required inputs.>`_ and fill out all of the required inputs.


Code review process
^^^^^^^^^^^^^^^^^^^^^^^^^
Pull request (PR) reviews are done on a regular basis. **If your PR does not address a previous issue, please make an issue first**.

If your change has a chance to affecting performance we will run our private benchmark suite to validate it.

Please, make sure you respond to our feedback/questions.


Community
^^^^^^^^^^^^^^^^^^^^^^^^^
If you have additional questions or you want to chat with MindsDB core team, you can join our community: 

.. raw:: html

    <embed>
    <a href="https://join.slack.com/t/mindsdbcommunity/shared_invite/zt-o8mrmx3l-5ai~5H66s6wlxFfBMVI6wQ" target="_blank"><img src="https://img.shields.io/badge/slack-@mindsdbcommunity-blueviolet.svg?logo=slack " alt="MindsDB Community"></a>
    </embed>
    
To get updates on Lightwood and MindsDB’s latest announcements, releases, and events, sign up for our `Monthly Community Newsletter <https://mindsdb.com/newsletter/?utm_medium=community&utm_source=github&utm_campaign=lightwood%20repo>`_.

Join our mission of democratizing machine learning and allowing developers to become data scientists!

Contributor Code of Conduct
^^^^^^^^^^^^^^^^^^^^^^^^^^^
Please note that this project is released with a `Contributor Code of Conduct <https://github.com/mindsdb/lightwood/blob/main/CODE_OF_CONDUCT.md>`_. By participating in this project, you agree to abide by its terms.


Current contributors 
=======================

.. raw:: html

    <embed>
    <a href="https://github.com/mindsdb/lightwood/graphs/contributors">
      <img src="https://contributors-img.web.app/image?repo=mindsdb/lightwood" />
    </a>
    </embed>


License
=======================
.. raw:: html

    <embed>
    <img src="https://img.shields.io/pypi/l/lightwood" alt="PyPI - License">
    </embed>

| `Lightwood License <https://github.com/mindsdb/lightwood/blob/master/LICENSE>`_





Other Links
=======================
.. toctree::
   :maxdepth: 8

   tutorials
   api
   data
   encoder
   mixer
   ensemble
   analysis
   helpers
   lightwood_philosophy

================================================
FILE: docssrc/source/lightwood_philosophy.rst
================================================
:mod:`Lightwood Philosophy`
================================


Introduction
------------

Lightwood works by generating code for `Predictor` objects out of structured data (e.g. a data frame) and a problem definition. The simplest possible definition being the column to predict.

The data can be anything. It can contain numbers, dates, categories, text (in any language, but English is currently the primary focus), quantities, arrays, matrices, images, audio, or video. The last three as paths to the file system or URLs, since storing them as binary data can be cumbersome.

The generated `Predictor` object can be fitted by calling a learn method, or through a lower level step-by-step API. It can then make predictions on similar data (same columns except for the target) by calling a predict method. That's the gist of it.

There's an intermediate representation that gets turned into the final `Python` code, called `JsonAI`. This provides an easy way to edit the `Predictor` being generated from the original data and problem specifications. It also enables prototyping custom code without modifying the library itself, or even having a "development" version of the library installed.

Pipeline
------------

Lightwood abstracts the ML pipeline into 3 core steps:

1. Pre-processing and data cleaning
2. Feature engineering
3. Model building and training

.. image:: _static/logos/lightwood.png
    :align: center
    :alt: Lightwood "under-the-hood"

By default, each of them entails:

i) Pre-processing and cleaning
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
For each column in your dataset, Lightwood will infer the suspected data type (numeric, categorical, etc.) via a brief statistical analysis. From this, it will generate a JsonAI object. 

Lightwood will perform a brief pre-processing approach to clean each column according to its identified data type (e.g. dates represented as a mix of string formats and timestamp floats are converted to datetime objects). From there, it will split the data into train/dev/test splits.

The `cleaner` and `splitter` objects respectively refer to the pre-processing and the data splitting functions.

ii) Feature Engineering
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Data can be converted into features via "encoders". Encoders represent the rules for transforming pre-processed data into a numerical representation that a model can use. 

Encoders can be **rule-based** or **learned**. A rule-based encoder transforms data per a specific set of instructions (ex: normalized numerical data) whereas a learned encoder produces a representation of the data after training (ex: a "\[CLS\]" token in a language model).

Encoders are assigned to each column of data based on the data type, and depending on the type there can be inter-column dependencies (e.g. time series). Users can override this assignment either at the column-based level or at the datatype-based level. Encoders inherit from the `BaseEncoder` class. 

iii) Model Building and Training
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
We call a predictive model that intakes *encoded* feature data and outputs a prediction for the target of interest a `mixer` model. Users can either use Lightwood's default mixers or create their own approaches inherited from the `BaseMixer` class.

We predominantly use PyTorch based approaches, but can support other models.

Multiple mixers can be trained for any given `Predictor`. After mixer training, an ensemble is created (and potentially trained) to decide which mixers to use and how to combine their predictions.

Finally, a "model analysis" step looks at the whole ensemble and extracts some stats about it, in addition to building confidence models that allow us to output a confidence and prediction intervals for each prediction. We also use this step to generate some explanations about model behavior.

Predicting is very similar: data is cleaned and encoded, then mixers make their predictions and they get ensembled. Finally, explainer modules determine things like confidence, prediction bounds, and column importances.


Strengths and drawbacks
------------------------

The main benefit of lightwood's architecture is that it is very easy to extend. Full understanding (or even any understanding) of the pipeline is not required to improve a specific component. Users can still easily integrate their custom code with minimal hassle, even if PRs are not accepted, while still pulling everything else from upstream. This works well with the open-source nature of the project.

The second advantage this provides is that it is relatively trivial to parallelize since most tasks are done per-feature. The bits which are done on all the data (mixer training and model analysis) are made up of multiple blocks with similar APIs which can themselves be run in parallel.

Finally, most of lightwood is built on PyTorch, and PyTorch mixers and encoders are first-class citizens in so far as the data format makes it easiest to work with them. In that sense performance on specialized hardware and continued compatibility is taken care of for us, which frees up time to work on other things.

The main drawback, however, is that the pipeline separation doesn't allow for phases to wield great influence on each other or run in a joint fashion. This both means you can't easily have stuff like mixer gradients propagating through and training encoders, nor analysis blocks looking at the model and deciding the data cleaning procedure should change. Granted, there's no hard limit on this, but any such implementation would be rather unwieldy in terms of code complexity.






================================================
FILE: docssrc/source/mixer.rst
================================================
:mod:`Mixers`
==========================

Mixers learn to map encoded representation, they are the core of lightwood's automl.

.. automodule:: mixer
   :members:


================================================
FILE: docssrc/source/tutorials/README.md
================================================
## How to make a tutorial notebook?

We use some of our tutorial notebooks as unit-tests to ensure that our pipeline is up-to-date, and to keep our examples relevant. 

In order to preserve our (and the reader's) sanity these need to behave so that they can:
1. Run via the CI tools
2. Execute locally by a user

To make things easier, the Lightwood team has proposed a general set of rules for tutorials:

1. If you are using an external dataset, please ensure there is a URL that links to it (i.e.: load it in a dataframe using `pd.read_csv('{link}')`). Exceptions can be made for custom data types if the download dataset link is provided. We try to avoid hosting large datasets via Github, but please contact us if you believe it should be in our benchmarking suite.
2. Show any **custom code within the notebook**. If you need to export it to a file (e.g. in order to load it as a lightwood module), use `%%writefile my_file.py` at the top of the jupyter codeblock, this will write the code into a file.
3. Please do not save any extra files within the notebook (`.json` files may be ok); if your tutorial really requires saving extra files, please contact us and we can help.
4. Please edit json-ai within the notebook as opposed to externally (i.e. generate a default, then make changes based on the key you need). You can show the difference between default and custom json-ai via a `print` statement.
5. The notebook must not have any code metadata, otherwise github actions will fail to run them (in the json representation, grep for `kernel` and you will find the global `metadata` key, set that to `{}`)


If your tutorial is anything more than a single `.ipynb` notebook and some accompanying .png or .jpg files, it may be rejected automatically. We would be more than happy to work with you to help adapt them to fit our automated notebooks. 


================================================
FILE: docssrc/source/tutorials/custom_cleaner/custom_cleaner.ipynb
================================================
{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "regulated-manufacturer",
   "metadata": {},
   "source": [
    "## Using your own pre-processing methods in Lightwood\n",
    "\n",
    "#### Date: 2021.10.07\n",
    "\n",
    "For the notebook below, we'll be exploring how to make **custom pre-processing** methods for our data. Lightwood has standard cleaning protocols to handle a variety of different data types, however, we want users to feel comfortable augmenting and addressing their own changes. To do so, we'll highlight the approach we would take below:\n",
    "\n",
    "\n",
    "We will use data from [Kaggle](https://www.kaggle.com/c/commonlitreadabilityprize/data?select=train.csv). \n",
    "\n",
    "The data has several columns, but ultimately aims to use text to predict a *readability score*. There are also some columns that I do not want to use when making predictions, such as `url_legal`, `license`, among others.\n",
    "\n",
    "In this tutorial, we're going to focus on making changes to 2 columns: \n",
    "(1) **excerpt**, a text column, and ensuring we remove stop words using NLTK. <br>\n",
    "(2) **target**, the goal to predict; we will make this explicitly non-negative.\n",
    "\n",
    "Note, for this ACTUAL challenge, negative and positive are meaningful. We are using this as an example dataset to demonstrate how you can make changes to your underlying dataset and proceed to building powerful predictors.\n",
    "\n",
    "Let's get started!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "happy-wheat",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2022-02-03T21:30:13.425276Z",
     "iopub.status.busy": "2022-02-03T21:30:13.424404Z",
     "iopub.status.idle": "2022-02-03T21:30:15.210014Z",
     "shell.execute_reply": "2022-02-03T21:30:15.209637Z"
    }
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import pandas as pd\n",
    "import torch\n",
    "import nltk\n",
    "\n",
    "import os\n",
    "import sys\n",
    "\n",
    "# Lightwood modules\n",
    "import lightwood as lw\n",
    "from lightwood import ProblemDefinition, \\\n",
    "                      JsonAI, \\\n",
    "                      json_ai_from_problem, \\\n",
    "                      code_from_json_ai, \\\n",
    "                      predictor_from_code"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "indie-chaos",
   "metadata": {},
   "source": [
    "### 1) Load your data\n",
    "\n",
    "Lightwood uses `pandas` in order to handle datasets, as this is a very standard package in datascience. We can load our dataset using pandas in the following manner (make sure your data is in the data folder!)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "recognized-parish",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2022-02-03T21:30:15.214940Z",
     "iopub.status.busy": "2022-02-03T21:30:15.214680Z",
     "iopub.status.idle": "2022-02-03T21:30:18.082996Z",
     "shell.execute_reply": "2022-02-03T21:30:18.082726Z"
    }
   },
   "outputs": [],
   "source": [
    "# Load the data\n",
    "data = pd.read_csv(\"https://mindsdb-example-data.s3.eu-west-2.amazonaws.com/jupyter/train.csv.zip\")\n",
    "data.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "official-wright",
   "metadata": {},
   "source": [
    "We see **6 columns**, a variety which are numerical, missing numbers, text, and identifiers or \"ids\". For our predictive task, we are only interested in 2 such columns, the **excerpt** and **target** columns.\n",
    "\n",
    "### 2) Create a JSON-AI default object\n",
    "Before we create a custom cleaner object, let's first create JSON-AI syntax for our problem based on its specifications. We can do so by setting up a ``ProblemDefinition``. The ``ProblemDefinition`` allows us to specify the target, the column we intend to predict, along with other details. \n",
    "\n",
    "The end goal of JSON-AI is to provide *a set of instructions on how to compile a machine learning pipeline*.\n",
    "\n",
    "In this case, let's specify our target, the aptly named **target** column. We will also tell JSON-AI to throw away features we never intend to use, such as \"url_legal\", \"license\", and \"standard_error\". We can do so in the following lines:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "chicken-truth",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2022-02-03T21:30:18.085631Z",
     "iopub.status.busy": "2022-02-03T21:30:18.085365Z",
     "iopub.status.idle": "2022-02-03T21:30:33.992691Z",
     "shell.execute_reply": "2022-02-03T21:30:33.992410Z"
    }
   },
   "outputs": [],
   "source": [
    "# Setup the problem definition\n",
    "problem_definition = {\n",
    "    'target': 'target',\n",
    "    \"ignore_features\": [\"url_legal\", \"license\", \"standard_error\"]\n",
    "}\n",
    "\n",
    "# Generate the j{ai}son syntax\n",
    "json_ai = json_ai_from_problem(data, problem_definition)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "needed-flashing",
   "metadata": {},
   "source": [
    "Lightwood, as it processes the data, will provide the user a few pieces of information.\n",
    "\n",
    "(1) It drops the features we specify in the `ignore_features` argument <br>\n",
    "(2) It takes a small sample of data from each column to *automatically infer the data type* <br>\n",
    "(3) For each column that was not ignored, it identifies the most likely data type.<br>\n",
    "(4) It notices that \"ID\" is a hash-like-identifier.<br>\n",
    "(5) It conducts a small statistical analysis on the distributions in order to generate syntax.<br>\n",
    "\n",
    "As soon as you request a JSON-AI object, Lightwood automatically creates functional syntax from your data. You can see it as follows: "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "designed-condition",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2022-02-03T21:30:33.996223Z",
     "iopub.status.busy": "2022-02-03T21:30:33.995947Z",
     "iopub.status.idle": "2022-02-03T21:30:33.997746Z",
     "shell.execute_reply": "2022-02-03T21:30:33.997483Z"
    }
   },
   "outputs": [],
   "source": [
    "print(json_ai.to_json())"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "level-vacation",
   "metadata": {},
   "source": [
    "The above shows the minimal syntax required to create a functional JSON-AI object. For each feature you consider in the dataset, we specify the name of the feature, the type of encoder (feature-engineering method) to process the feature, and key word arguments to process the encoder. For the output, we perform a similar operation, but specify the types of mixers, or algorithms used in making a predictor that can estimate the target. Lastly, we populate the \"problem_definition\" key with the ingredients for our ML pipeline.\n",
    "\n",
    "These are the only elements required to get off the ground with JSON-AI. However, we're interested in making a *custom* approach. So, let's make this syntax a file, and introduce our own changes."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "integrated-entrepreneur",
   "metadata": {},
   "source": [
    "### 3) Build your own cleaner module\n",
    "\n",
    "Let's make a file called `MyCustomCleaner.py`. To write this file, we will use `dataprep_ml.cleaners.cleaner` as inspiration. `dataprep_ml` is a companion library that is part of the broader MindsDB ecosystem, and specializes in data cleaning, data splitting and data analysis.\n",
    "\n",
    "The goal output of the cleaner is to provide pre-processing to your dataset - the output is only a pandas DataFrame. In theory, any pre-processing can be done here. However, data can be highly irregular - our default `Cleaner` function has several main goals:\n",
    "\n",
    "(1) Strip away any identifier, etc. unwanted columns <br>\n",
    "(2) Apply a cleaning function to each column in the dataset, according to that column's data type <br>\n",
    "(3) Standardize NaN values within each column for appropriate downstream treatment <br>\n",
    "\n",
    "You can choose to omit many of these details and completely write this module from scratch, but the easiest way to introduce your custom changes is to borrow the `Cleaner` function, and add core changes in a custom block.\n",
    "\n",
    "This can be done as follows\n",
    "\n",
    "\n",
    "You can see individual cleaning functions in `dataprep_ml.cleaners`. If you want to entirely replace a cleaning technique given a particular data-type, we invite you to change `dataprep_ml.cleaners.get_cleaning_func` using the argument `custom_cleaning_functions`; in this dictionary, for a datatype (specified in `type_infer.dtype`), you can assign your own function to override our defaults."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "325d8f1b",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2022-02-03T21:30:34.001348Z",
     "iopub.status.busy": "2022-02-03T21:30:34.001032Z",
     "iopub.status.idle": "2022-02-03T21:30:34.002545Z",
     "shell.execute_reply": "2022-02-03T21:30:34.002730Z"
    }
   },
   "outputs": [],
   "source": [
    "%%writefile MyCustomCleaner.py\n",
    "\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "from type_infer.dtype import dtype\n",
    "\n",
    "from lightwood.helpers import text\n",
    "from lightwood.helpers.log import log\n",
    "from lightwood.api.types import TimeseriesSettings\n",
    "\n",
    "from nltk.corpus import stopwords\n",
    "\n",
    "stop_words = set(stopwords.words(\"english\"))\n",
    "\n",
    "from typing import Dict\n",
    "\n",
    "# Borrow cleaner functions\n",
    "from dataprep_ml.cleaners import (\n",
    "    _remove_columns,\n",
    "    _get_columns_to_clean,\n",
    "    get_cleaning_func,\n",
    ")\n",
    "\n",
    "# Use for standardizing NaNs\n",
    "VALUES_FOR_NAN_AND_NONE_IN_PANDAS = [np.nan, \"nan\", \"NaN\", \"Nan\", \"None\"]\n",
    "\n",
    "\n",
    "def cleaner(\n",
    "    data: pd.DataFrame,\n",
    "    dtype_dict: Dict[str, str],\n",
    "    identifiers: Dict[str, str],\n",
    "    target: str,\n",
    "    mode: str,\n",
    "    timeseries_settings: TimeseriesSettings,\n",
    "    anomaly_detection: bool,\n",
    "    custom_cleaning_functions: Dict[str, str] = {},\n",
    ") -> pd.DataFrame:\n",
    "    \"\"\"\n",
    "    The cleaner is a function which takes in the raw data, plus additional information about it's types and about the problem. Based on this it generates a \"clean\" representation of the data, where each column has an ideal standardized type and all malformed or otherwise missing or invalid elements are turned into ``None``\n",
    "\n",
    "    :param data: The raw data\n",
    "    :param dtype_dict: Type information for each column\n",
    "    :param identifiers: A dict containing all identifier typed columns\n",
    "    :param target: The target columns\n",
    "    :param mode: Can be \"predict\" or \"train\"\n",
    "    :param timeseries_settings: Timeseries related settings, only relevant for timeseries predictors, otherwise can be the default object\n",
    "    :param anomaly_detection: Are we detecting anomalies with this predictor?\n",
    "\n",
    "    :returns: The cleaned data\n",
    "    \"\"\"  # noqa\n",
    "\n",
    "    data = _remove_columns(\n",
    "        data,\n",
    "        identifiers,\n",
    "        target,\n",
    "        mode,\n",
    "        timeseries_settings,\n",
    "        anomaly_detection,\n",
    "        dtype_dict,\n",
    "    )\n",
    "\n",
    "    for col in _get_columns_to_clean(data, dtype_dict, mode, target):\n",
    "\n",
    "        log.info(\"Cleaning column =\" + str(col))\n",
    "        # Get and apply a cleaning function for each data type\n",
    "        # If you want to customize the cleaner, it's likely you can to modify ``get_cleaning_func``\n",
    "        fn, vec = get_cleaning_func(dtype_dict[col], custom_cleaning_functions)\n",
    "        if not vec:\n",
    "            data[col] = data[col].apply(fn)\n",
    "        if vec:\n",
    "            data[col] = fn(data[col])\n",
    "\n",
    "        # ------------------------ #\n",
    "        # INTRODUCE YOUR CUSTOM BLOCK\n",
    "\n",
    "        # If column data type is a text type, remove stop-words\n",
    "        if dtype_dict[col] in (dtype.rich_text, dtype.short_text):\n",
    "            data[col] = data[col].apply(\n",
    "                lambda x: \" \".join(\n",
    "                    [word for word in x.split() if word not in stop_words]\n",
    "                )\n",
    "            )\n",
    "\n",
    "        # Enforce numerical columns as non-negative\n",
    "        if dtype_dict[col] in (dtype.integer, dtype.float):\n",
    "            log.info(\"Converted \" + str(col) + \" into strictly non-negative\")\n",
    "            data[col] = data[col].apply(lambda x: x if x > 0 else 0.0)\n",
    "\n",
    "        # ------------------------ #\n",
    "        data[col] = data[col].replace(\n",
    "            to_replace=VALUES_FOR_NAN_AND_NONE_IN_PANDAS, value=None\n",
    "        )\n",
    "\n",
    "    return data\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "radical-armenia",
   "metadata": {},
   "source": [
    "#### Place your custom module in `~/lightwood_modules` or `/etc/lightwood_modules`\n",
    "\n",
    "We automatically search for custom scripts in your `~/lightwood_modules` and `/etc/lightwood_modules` path. Place your file there. Later, you'll see when we autogenerate code, that you can change your import location if you choose."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f030f8ca",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2022-02-03T21:30:34.005036Z",
     "iopub.status.busy": "2022-02-03T21:30:34.004771Z",
     "iopub.status.idle": "2022-02-03T21:30:34.006037Z",
     "shell.execute_reply": "2022-02-03T21:30:34.006254Z"
    }
   },
   "outputs": [],
   "source": [
    "from lightwood import load_custom_module\n",
    "\n",
    "# Lightwood automatically does this for us if we want\n",
    "load_custom_module('MyCustomCleaner.py')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "characteristic-promotion",
   "metadata": {},
   "source": [
    "### 4) Introduce your custom cleaner in JSON-AI\n",
    "\n",
    "Now let's introduce our custom cleaner. JSON-AI keeps a lightweight syntax but fills in many default modules (like splitting, cleaning). As you can see, it is also agnostic to the origin of the module, as long as it behaves as expected of the other modules that could be used in any given key.\n",
    "\n",
    "For the custom cleaner, we'll work by editing the \"cleaner\" key. We will change properties within it as follows:\n",
    "(1) \"module\" - place the name of the function. In our case it will be \"MyCustomCleaner.cleaner\"\n",
    "(2) \"args\" - any keyword argument specific to your cleaner's internals. \n",
    "\n",
    "This will look as follows:\n",
    "```\n",
    "    \"cleaner\": {\n",
    "        \"module\": \"MyCustomCleaner.cleaner\",\n",
    "        \"args\": {\n",
    "            \"identifiers\": \"$identifiers\",\n",
    "            \"data\": \"data\",\n",
    "            \"dtype_dict\": \"$dtype_dict\",\n",
    "            \"target\": \"$target\",\n",
    "            \"mode\": \"$mode\",\n",
    "            \"timeseries_settings\": \"$problem_definition.timeseries_settings\",\n",
    "            \"anomaly_detection\": \"$problem_definition.anomaly_detection\"\n",
    "        }\n",
    "```\n",
    "\n",
    "You may be wondering what the \"$\" variables reference. In certain cases, we'd like JSON-AI to auto-fill internal variables when automatically generating code, for example, we've already specified the \"target\" - it would be easier to simply refer in a modular sense what that term is. That is what these variables represent.\n",
    "\n",
    "As we borrowed most of the default `Cleaner`; we keep these arguments. In theory, if we were writing much of these details from scratch, we can customize these values as necessary."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "respiratory-radiation",
   "metadata": {},
   "source": [
    "### 5) Generate Python code representing your ML pipeline\n",
    "\n",
    "Now we're ready to load up our custom JSON-AI and generate the predictor code!\n",
    "\n",
    "We can do this by first reading in our custom json-syntax, and then calling the function `code_from_json_ai`. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "floating-patent",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2022-02-03T21:30:34.009559Z",
     "iopub.status.busy": "2022-02-03T21:30:34.009134Z",
     "iopub.status.idle": "2022-02-03T21:30:34.014254Z",
     "shell.execute_reply": "2022-02-03T21:30:34.014480Z"
    }
   },
   "outputs": [],
   "source": [
    "# Make changes to your JSON-AI\n",
    "json_ai.cleaner = {\n",
    "        \"module\": \"MyCustomCleaner.cleaner\",\n",
    "        \"args\": {\n",
    "            \"identifiers\": \"$identifiers\",\n",
    "            \"data\": \"data\",\n",
    "            \"dtype_dict\": \"$dtype_dict\",\n",
    "            \"target\": \"$target\",\n",
    "            \"mode\": \"$mode\",\n",
    "            \"timeseries_settings\": \"$problem_definition.timeseries_settings.to_dict()\",\n",
    "            \"anomaly_detection\": \"$problem_definition.anomaly_detection\"\n",
    "        }\n",
    "}\n",
    "\n",
    "#Generate python code that fills in your pipeline\n",
    "code = code_from_json_ai(json_ai)\n",
    "\n",
    "print(code)\n",
    "\n",
    "# Save code to a file (Optional)\n",
    "with open('custom_cleaner_pipeline.py', 'w') as fp:\n",
    "    fp.write(code)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "handled-oasis",
   "metadata": {},
   "source": [
    "As you can see, an end-to-end pipeline of our entire ML procedure has been generating. There are several abstracted functions to enable transparency as to what processes your data goes through in order to build these models.\n",
    "\n",
    "The key steps of the pipeline are as follows:\n",
    "\n",
    "(1) Run a **statistical analysis** with `analyze_data` <br>\n",
    "(2) Clean your data with `preprocess` <br>\n",
    "(3) Make a training/dev/testing split with `split` <br>\n",
    "(4) Prepare your feature-engineering pipelines with `prepare` <br>\n",
    "(5) Create your features with `featurize` <br>\n",
    "(6) Fit your predictor models with `fit` <br>\n",
    "\n",
    "You can customize this further if necessary, but you have all the steps necessary to train a model!\n",
    "\n",
    "We recommend familiarizing with these steps by calling the above commands, ideally in order. Some commands (namely `prepare`, `featurize`, and `fit`) do depend on other steps.\n",
    "\n",
    "If you want to omit the individual steps, we recommend your simply call the `learn` method, which compiles all the necessary steps implemented to give your fully trained predictive models starting with unprocessed data! "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "meaning-saskatchewan",
   "metadata": {},
   "source": [
    "### 6) Call python to run your code and see your preprocessed outputs\n",
    "\n",
    "Once we have code, we can turn this into a python object by calling `predictor_from_code`. This instantiates the `PredictorInterface` object. \n",
    "\n",
    "This predictor object can be then used to run your pipeline."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "violent-guard",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2022-02-03T21:30:34.016713Z",
     "iopub.status.busy": "2022-02-03T21:30:34.016311Z",
     "iopub.status.idle": "2022-02-03T21:30:34.020897Z",
     "shell.execute_reply": "2022-02-03T21:30:34.021123Z"
    }
   },
   "outputs": [],
   "source": [
    "# Turn the code above into a predictor object\n",
    "predictor = predictor_from_code(code)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "closing-episode",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2022-02-03T21:30:34.023678Z",
     "iopub.status.busy": "2022-02-03T21:30:34.023275Z",
     "iopub.status.idle": "2022-02-03T21:30:34.114309Z",
     "shell.execute_reply": "2022-02-03T21:30:34.114071Z"
    },
    "scrolled": false
   },
   "outputs": [],
   "source": [
    "predictor.mode = \"train\"\n",
    "\n",
    "# Perform stats analysis\n",
    "predictor.analyze_data(data)\n",
    "\n",
    "# Pre-process the data\n",
    "cleaned_data = predictor.preprocess(data)\n",
    "\n",
    "cleaned_data.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "major-stake",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2022-02-03T21:30:34.118015Z",
     "iopub.status.busy": "2022-02-03T21:30:34.117499Z",
     "iopub.status.idle": "2022-02-03T21:30:34.119943Z",
     "shell.execute_reply": "2022-02-03T21:30:34.119720Z"
    }
   },
   "outputs": [],
   "source": [
    "print(\"\\033[1m\"  + \"Original Data\\n\" + \"\\033[0m\")\n",
    "print(\"Excerpt:\\n\", data.iloc[0][\"excerpt\"])\n",
    "print(\"\\nTarget:\\n\", data.iloc[0][\"target\"])\n",
    "\n",
    "print(\"\\033[1m\"  + \"\\n\\nCleaned Data\\n\" + \"\\033[0m\")\n",
    "print(\"Excerpt:\\n\", cleaned_data.iloc[0][\"excerpt\"])\n",
    "print(\"\\nTarget:\\n\", cleaned_data.iloc[0][\"target\"])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "celtic-scientist",
   "metadata": {},
   "source": [
    "As you can see, the cleaning-process we introduced cut out the stop-words from the Excerpt, and enforced the target data to stay positive.\n",
    "\n",
    "We hope this tutorial was informative on how to introduce a **custom preprocessing method** to your datasets! For more customization tutorials, please check our [documentation](https://lightwood.io/tutorials.html).\n",
    "\n",
    "If you want to download the Jupyter-notebook version of this tutorial, check out the source github location found here: `lightwood/docssrc/source/tutorials/custom_cleaner`. "
   ]
  }
 ],
 "metadata": {
  "interpreter": {
   "hash": "31f2aee4e71d21fbe5cf8b01ff0e069b9275f58929596ceb00d14d90e3e16cd6"
  },
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}


================================================
FILE: docssrc/source/tutorials/custom_encoder_rulebased/custom_encoder_rulebased.ipynb
================================================
{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "smooth-philip",
   "metadata": {},
   "source": [
    "### Custom Encoder: Rule-Based\n",
    "\n",
    "Lightwood uses \"Encoders\" to convert preprocessed (cleaned) data into **features**. Encoders represent the **feature engineering** step of the data science pipeline; they can either have a set of instructions (\"rule-based\") or a learned representation (trained on data).\n",
    "\n",
    "In the following notebook, we will experiment with creating a custom encoder that creates **Label Encoding**. \n",
    "\n",
    "For example, imagine we have the following set of categories:\n",
    "\n",
    "```\n",
    "MyColumnData = [\"apple\", \"orange\", \"orange\", \"banana\", \"apple\", \"dragonfruit\"]\n",
    "```\n",
    "\n",
    "There are 4 categories to consider: \"apple\", \"banana\", \"orange\", and \"dragonfruit\".\n",
    "\n",
    "**Label encoding** allows you to refer to these categories as if they were numbers. For example, consider the mapping (arranged alphabetically):\n",
    "\n",
    "1 - apple <br>\n",
    "2 - banana <br>\n",
    "3 - dragonfruit <br>\n",
    "4 - orange <br>\n",
    "\n",
    "Using this mapping, we can convert the above data as follows:\n",
    "\n",
    "```\n",
    "MyFeatureData = [1, 4, 4, 2, 1, 3]\n",
    "```\n",
    "\n",
    "In the following notebook, we will design a **LabelEncoder** for Lightwood for use on categorical data. We will be using the Kaggle \"Used Car\" [dataset](https://www.kaggle.com/adityadesai13/used-car-dataset-ford-and-mercedes). We've provided a link for you to automatically access this CSV. This dataset describes various details of cars on sale - with the goal of predicting how much this car may sell for.\n",
    "\n",
    "Let's get started."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "raising-adventure",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2022-02-03T21:30:36.493568Z",
     "iopub.status.busy": "2022-02-03T21:30:36.493252Z",
     "iopub.status.idle": "2022-02-03T21:30:37.976228Z",
     "shell.execute_reply": "2022-02-03T21:30:37.975946Z"
    }
   },
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "\n",
    "# Lightwood modules\n",
    "import lightwood as lw\n",
    "from lightwood import ProblemDefinition, \\\n",
    "                      JsonAI, \\\n",
    "                      json_ai_from_problem, \\\n",
    "                      code_from_json_ai, \\\n",
    "                      predictor_from_code"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "instant-income",
   "metadata": {},
   "source": [
    "### 1) Load your data\n",
    "\n",
    "Lightwood works with `pandas.DataFrame`s; load data via pandas as follows:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "technical-government",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2022-02-03T21:30:37.981255Z",
     "iopub.status.busy": "2022-02-03T21:30:37.980893Z",
     "iopub.status.idle": "2022-02-03T21:30:38.234611Z",
     "shell.execute_reply": "2022-02-03T21:30:38.234810Z"
    }
   },
   "outputs": [],
   "source": [
    "filename = 'https://raw.githubusercontent.com/mindsdb/benchmarks/main/benchmarks/datasets/used_car_price/data.csv'\n",
    "df = pd.read_csv(filename)\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "anonymous-rainbow",
   "metadata": {},
   "source": [
    "We can see a handful of columns above, such as `model, year, price, transmission, mileage, fuelType, tax, mpg, engineSize`. Some columns are numerical whereas others are categorical. We are going to specifically only focus on categorical columns.\n",
    "\n",
    "\n",
    "### 2) Generate JSON-AI Syntax\n",
    "\n",
    "We will make a `LabelEncoder` as follows:\n",
    "\n",
    "(1) Find all unique examples within a column <br>\n",
    "(2) Order the examples in a consistent way <br>\n",
    "(3) Label (python-index of 0 as start) each category <br>\n",
    "(4) Assign the label according to each datapoint. <br>\n",
    "\n",
    "First, let's generate a JSON-AI syntax so we can automatically identify each column. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "absent-maker",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2022-02-03T21:30:38.237433Z",
     "iopub.status.busy": "2022-02-03T21:30:38.237167Z",
     "iopub.status.idle": "2022-02-03T21:30:38.968313Z",
     "shell.execute_reply": "2022-02-03T21:30:38.968531Z"
    }
   },
   "outputs": [],
   "source": [
    "# Create the Problem Definition\n",
    "pdef = ProblemDefinition.from_dict({\n",
    "    'target': 'price', # column you want to predict\n",
    "    #'ignore_features': ['year', 'mileage', 'tax', 'mpg', 'engineSize']\n",
    "})\n",
    "\n",
    "# Generate a JSON-AI object\n",
    "json_ai = json_ai_from_problem(df, problem_definition=pdef)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "swedish-riverside",
   "metadata": {},
   "source": [
    "Let's take a look at our JSON-AI and print to file."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "coastal-paragraph",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2022-02-03T21:30:38.972233Z",
     "iopub.status.busy": "2022-02-03T21:30:38.971959Z",
     "iopub.status.idle": "2022-02-03T21:30:38.973971Z",
     "shell.execute_reply": "2022-02-03T21:30:38.973749Z"
    }
   },
   "outputs": [],
   "source": [
    "print(json_ai.to_json())"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "expired-flour",
   "metadata": {},
   "source": [
    "### 3) Create your custom encoder (`LabelEncoder`).\n",
    "\n",
    "Once our JSON-AI is filled, let's make our LabelEncoder. All Lightwood encoders inherit from the `BaseEncoder` class, found [here](https://github.com/mindsdb/lightwood/blob/staging/lightwood/encoder/base.py). \n",
    "\n",
    "![BaseEncoder](baseencoder.png)\n",
    "\n",
    "\n",
    "The `BaseEncoder` has 5 expected calls:\n",
    "\n",
    "- `__init__`: instantiate the encoder\n",
    "- `prepare`: Train or create the rules of the encoder\n",
    "- `encode`: Given data, convert to the featurized representation\n",
    "- `decode`: Given featurized representations, revert back to data\n",
    "- `to`: Use CPU/GPU (mostly important for learned representations)\n",
    "\n",
    "From above, we see that \"model\", \"transmission\", and \"fuelType\" are all categorical columns. These will be the ones we want to modify."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "verbal-northwest",
   "metadata": {},
   "source": [
    "##### `LabelEncoder`\n",
    "\n",
    "The `LabelEncoder` should satisfy a couple of rules\n",
    "\n",
    "(1) For the ``__init__`` call: <br>\n",
    "  - Specify the only argument `is_target`; this asks whether the encoder aims to represent the target column.<br>\n",
    "  - Set `is_prepared=False` in the initialization. All encoders are prepared using their `prepare()` call, which turns this flag on to `True` if preparation of the encoders is successful. <br>\n",
    "  - Set `output_size=1`; the output size refers to how many options the represented encoder may adopt. \n",
    "    \n",
    "    \n",
    "(2) For the ``prepare`` call:\n",
    "  - Specify the only argument `priming_data`; this provides the `pd.Series` of the data column for the encoder.\n",
    "  - Find all unique categories in the column data\n",
    "  - Make a dictionary representing label number to category (reserves 0 as Unknown) and the inverse dictionary\n",
    "  - Set `is_prepared=True`\n",
    "  \n",
    "(3) The `encode()` call will convert each data point's category name into the encoded label.\n",
    "\n",
    "(4) The `decode()` call will convert a previously encoded label into the original category name.\n",
    "\n",
    "Given this approach only uses simple dictionaries, there is no need for a dedicated `to()` call (although this would inherit `BaseEncoder`'s implementation).\n",
    "\n",
    "This implementation would look as follows:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e03db1b0",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2022-02-03T21:30:38.976898Z",
     "iopub.status.busy": "2022-02-03T21:30:38.975171Z",
     "iopub.status.idle": "2022-02-03T21:30:38.978711Z",
     "shell.execute_reply": "2022-02-03T21:30:38.978491Z"
    }
   },
   "outputs": [],
   "source": [
    "%%writefile LabelEncoder.py\n",
    "\n",
    "\"\"\"\n",
    "2021.10.13\n",
    "\n",
    "Create a LabelEncoder that transforms categorical data into a label.\n",
    "\"\"\"\n",
    "import pandas as pd\n",
    "import torch\n",
    "\n",
    "from lightwood.encoder import BaseEncoder\n",
    "from typing import List, Union\n",
    "from lightwood.helpers.log import log\n",
    "\n",
    "\n",
    "class LabelEncoder(BaseEncoder):\n",
    "    \"\"\"\n",
    "    Create a label representation for categorical data. The data will rely on sorted to organize the order of the labels.\n",
    "\n",
    "    Class Attributes:\n",
    "    - is_target: Whether this is used to encode the target\n",
    "    - is_prepared: Whether the encoder rules have been set (after ``prepare`` is called)\n",
    "\n",
    "    \"\"\"  # noqa\n",
    "\n",
    "    is_target: bool\n",
    "    is_prepared: bool\n",
    "\n",
    "    is_timeseries_encoder: bool = False\n",
    "    is_trainable_encoder: bool = True\n",
    "\n",
    "    def __init__(self, is_target: bool = False, stop_after = 10) -> None:\n",
    "        \"\"\"\n",
    "        Initialize the Label Encoder\n",
    "\n",
    "        :param is_target:\n",
    "        \"\"\"\n",
    "        self.is_target = is_target\n",
    "        self.is_prepared = False\n",
    "\n",
    "        # Size of the output encoded dimension per data point\n",
    "        # For LabelEncoder, this is always 1 (1 label per category)\n",
    "        self.output_size = 1\n",
    "\n",
    "    def prepare(self, train_data: pd.Series, dev_data: pd.Series) -> None:\n",
    "        \"\"\"\n",
    "        Create a LabelEncoder for categorical data.\n",
    "\n",
    "        LabelDict creates a mapping where each index is associated to a category.\n",
    "\n",
    "        :param priming_data: Input column data that is categorical.\n",
    "\n",
    "        :returns: Nothing; prepares encoder rules with `label_dict` and `ilabel_dict`\n",
    "        \"\"\"\n",
    "\n",
    "        # Find all unique categories in the dataset\n",
    "        categories = train_data.unique()\n",
    "\n",
    "        log.info(\"Categories Detected = \" + str(self.output_size))\n",
    "\n",
    "        # Create the Category labeller\n",
    "        self.label_dict = {\"Unknown\": 0}  # Include an unknown category\n",
    "        self.label_dict.update({cat: idx + 1 for idx, cat in enumerate(categories)})\n",
    "        self.ilabel_dict = {idx: cat for cat, idx in self.label_dict.items()}\n",
    "\n",
    "        self.is_prepared = True\n",
    "\n",
    "    def encode(self, column_data: Union[pd.Series, list]) -> torch.Tensor:\n",
    "        \"\"\"\n",
    "        Convert pre-processed data into the labeled values\n",
    "\n",
    "        :param column_data: Pandas series to convert into labels\n",
    "        \"\"\"\n",
    "        if isinstance(column_data, pd.Series):\n",
    "            enc = column_data.apply(lambda x: self.label_dict.get(x, 0)).tolist()\n",
    "        else:\n",
    "            enc = [self.label_dict.get(x, 0) for x in column_data]\n",
    "\n",
    "        return torch.Tensor(enc).int().unsqueeze(1)\n",
    "\n",
    "    def decode(self, encoded_data: torch.Tensor) -> List[object]:\n",
    "        \"\"\"\n",
    "        Convert torch.Tensor labels into categorical data\n",
    "\n",
    "        :param encoded_data: Encoded data in the form of a torch.Tensor\n",
    "        \"\"\"\n",
    "        return [self.ilabel_dict[i.item()] for i in encoded_data]\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "23cce8a8",
   "metadata": {},
   "source": [
    "Some additional notes:\n",
    "(1) The `encode()` call should be able to intake a list of values, it is optional to make it compatible with `pd.Series` or `pd.DataFrame` <br>\n",
    "(2) The output of `encode()` must be a torch tensor with dimensionality $N_{rows} x N_{output}$.\n",
    "\n",
    "Now that the `LabelEncoder` is complete, move this to `~/lightwood_modules` and we're ready to try this out!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e30866c1",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2022-02-03T21:30:38.980727Z",
     "iopub.status.busy": "2022-02-03T21:30:38.980489Z",
     "iopub.status.idle": "2022-02-03T21:30:38.982070Z",
     "shell.execute_reply": "2022-02-03T21:30:38.981851Z"
    }
   },
   "outputs": [],
   "source": [
    "from lightwood import load_custom_module\n",
    "load_custom_module('LabelEncoder.py')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "optical-archive",
   "metadata": {},
   "source": [
    "### 4) Edit JSON-AI\n",
    "\n",
    "Now that we have our `LabelEncoder` script, we have two ways of introducing this encoder:\n",
    "\n",
    "(1) Change all categorical columns to our encoder of choice <br>\n",
    "(2) Replace the default encoder (`Categorical.OneHotEncoder`) for categorical data to our encoder of choice <br>\n",
    "\n",
    "In the first scenario, we may not want to change ALL columns. By switching the encoder on a `Feature` level, Lightwood allows you to control how representations for a given feature are handled. However, suppose you want to replace an approach entirely with your own methodology - Lightwood supports overriding default methods to control how you want to treat a *data type* as well.\n",
    "\n",
    "Below, we'll show both strategies:"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "quiet-lodging",
   "metadata": {},
   "source": [
    "The first strategy requires just specifying which features you'd like to change. Once you have your list, you can manually set the encoder \"module\" to the class you'd like. **This is best suited for a few columns or if you only want to override a few particular columns as opposed to replacing the `Encoder` behavior for an entire data type**.\n",
    "#### Strategy 1: Change the encoders for the features directly\n",
    "```python\n",
    "for ft in [\"model\", \"transmission\", \"fuelType\"]: # Features you want to replace\n",
    "    # Set each feature to the custom encoder\n",
    "    json_ai.encoders[ft]['module'] = 'LabelEncoder.LabelEncoder'\n",
    "```\n",
    "\n",
    "\n",
    "Suppose you have many columns that are categorical- you may want to enforce your approach explicitly without naming each column. This can be done by examining the `data_dtype` of JSON-AI's features. For all features that are type `categorical` (while this is a `str`, it's ideal to import dtype and explicitly check the data type), replace the default `Encoder` with your encoder. In this case, this is `LabelEncoder.LabelEncoder`.\n",
    "#### Strategy 2: Programatically change *all* encoder assignments for a data type\n",
    "\n",
    "```python\n",
    "from lightwood.api import dtype\n",
    "for i in json_ai.dtype_dict:\n",
    "    if json_ai.dtype_dict[i] == dtype.categorical:\n",
    "        json_ai.encoders[i]['module'] = 'LabelEncoder.LabelEncoder'\n",
    "```\n",
    "\n",
    "We'll go with the first approach for simplicity:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "elementary-fusion",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2022-02-03T21:30:38.983970Z",
     "iopub.status.busy": "2022-02-03T21:30:38.983730Z",
     "iopub.status.idle": "2022-02-03T21:30:38.984867Z",
     "shell.execute_reply": "2022-02-03T21:30:38.985070Z"
    }
   },
   "outputs": [],
   "source": [
    "for ft in [\"model\", \"transmission\", \"fuelType\"]: # Features you want to replace\n",
    "    # Set each feature to the custom encoder\n",
    "    json_ai.encoders[ft]['module'] = 'LabelEncoder.LabelEncoder'"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "together-austria",
   "metadata": {},
   "source": [
    "### 5) Generate code and your predictor from JSON-AI\n",
    "\n",
    "Now, let's use this JSON-AI object to generate code and make a predictor. This can be done in 2 simple lines, below:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "inappropriate-james",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2022-02-03T21:30:38.987455Z",
     "iopub.status.busy": "2022-02-03T21:30:38.986031Z",
     "iopub.status.idle": "2022-02-03T21:30:38.993611Z",
     "shell.execute_reply": "2022-02-03T21:30:38.993823Z"
    }
   },
   "outputs": [],
   "source": [
    "#Generate python code that fills in your pipeline\n",
    "code = code_from_json_ai(json_ai)\n",
    "\n",
    "# Turn the code above into a predictor object\n",
    "predictor = predictor_from_code(code)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "personalized-andorra",
   "metadata": {},
   "source": [
    "Now, let's run our pipeline. To do so, let's first:\n",
    "\n",
    "(1) Perform a statistical analysis on the data (*this is important in preparing Encoders/Mixers as it populates the* `StatisticalAnalysis` *attribute with details some encoders need*). <br>\n",
    "(2) Clean our data <br>\n",
    "(3) Prepare the encoders <br>\n",
    "(4) Featurize the data <br>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "palestinian-harvey",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2022-02-03T21:30:38.996435Z",
     "iopub.status.busy": "2022-02-03T21:30:38.996175Z",
     "iopub.status.idle": "2022-02-03T21:30:39.355344Z",
     "shell.execute_reply": "2022-02-03T21:30:39.355539Z"
    }
   },
   "outputs": [],
   "source": [
    "# Perform Stats Analysis\n",
    "predictor.analyze_data(df)\n",
    "\n",
    "# Pre-process the data\n",
    "cleaned_data = predictor.preprocess(data=df)\n",
    "\n",
    "# Create a train/test split\n",
    "split_data = predictor.split(cleaned_data)\n",
    "\n",
    "# Prepare the encoders \n",
    "predictor.prepare(split_data)\n",
    "\n",
    "# Featurize the data\n",
    "ft_data = predictor.featurize(split_data)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ordered-beast",
   "metadata": {},
   "source": [
    "The splitter creates 3 data-splits, a \"train\", \"dev\", and \"test\" set. The `featurize` command from the predictor allows us to convert the cleaned data into features. We can access this as follows:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "silent-dealing",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2022-02-03T21:30:39.361308Z",
     "iopub.status.busy": "2022-02-03T21:30:39.361026Z",
     "iopub.status.idle": "2022-02-03T21:30:39.392427Z",
     "shell.execute_reply": "2022-02-03T21:30:39.392125Z"
    }
   },
   "outputs": [],
   "source": [
    "# Pick a categorical column name\n",
    "col_name = \"fuelType\"\n",
    "\n",
    "# Get the encoded feature data\n",
    "enc_ft = ft_data[\"train\"].get_encoded_column_data(col_name).squeeze(1) #torch tensor (N_rows x N_output_dim)\n",
    "\n",
    "# Get the original data from the dataset\n",
    "orig_data = ft_data[\"train\"].get_column_original_data(col_name) #pandas dataframe\n",
    "\n",
    "# Create a pandas data frame to compare encoded data and original data\n",
    "compare_data = pd.concat([orig_data, pd.Series(enc_ft, name=\"EncData\")], axis=1)\n",
    "compare_data.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fatty-peoples",
   "metadata": {},
   "source": [
    "We can see what the label mapping is by inspecting our encoders as follows:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "superior-mobility",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2022-02-03T21:30:39.395473Z",
     "iopub.status.busy": "2022-02-03T21:30:39.395212Z",
     "iopub.status.idle": "2022-02-03T21:30:39.396457Z",
     "shell.execute_reply": "2022-02-03T21:30:39.396663Z"
    }
   },
   "outputs": [],
   "source": [
    "# Label Name -> Label Number\n",
    "print(predictor.encoders[col_name].label_dict)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "frequent-remedy",
   "metadata": {},
   "source": [
    "For each category above, the number associated in the dictionary is the label for each category. This means \"Diesel\" is always represented by a 1, etc.\n",
    "\n",
    "With that, you've created your own custom Encoder that uses a rule-based approach! Please checkout more [tutorials](https://lightwood.io/tutorials.html) for other custom approach guides."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}


================================================
FILE: docssrc/source/tutorials/custom_explainer/custom_explainer.ipynb
================================================
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Tutorial - Implementing a custom analysis block in Lightwood\n",
    "\n",
    "\n",
    "## Introduction\n",
    "\n",
    "As you might already know, Lightwood is designed to be a flexible machine learning (ML) library that is able to abstract and automate the entire ML pipeline. Crucially, it is also designed to be extended or modified very easily according to your needs, essentially offering the entire spectrum between fully automated AutoML and a lightweight wrapper for customized ML pipelines.\n",
    "\n",
    "As such, we can identify several different customizable \"phases\" in the process. The relevant phase for this tutorial is the \"analysis\" that comes after a predictor has been trained. The goal of this phase is to generate useful insights, like accuracy metrics, confusion matrices, feature importance, etc. These particular examples are all included in the core analysis procedure that Lightwood executes.\n",
    "\n",
    "However, the analysis procedure is structured into a sequential execution of \"analysis blocks\". Each analysis block should generate a well-defined set of insights, as well as handling any actions regarding these at inference time.\n",
    "\n",
    "As an example, one of the core blocks is the Inductive Conformal Prediction (`ICP`) block, which handles the confidence estimation of all Lightwood predictors. The logic within can be complex at times, but thanks to the block abstraction we can deal with it in a structured manner. As this `ICP` block is used when generating predictions, it implements the two main methods that the `BaseAnalysisBlock` class specifies: `.analyze()` to setup everything that is needed, and `.explain()` to actually estimate the confidence in any given prediction.\n",
    "\n",
    "\n",
    "## Objective\n",
    "\n",
    "In this tutorial, we will go through the steps required to implement your own analysis blocks to customize the insights of any Lightwood predictor!\n",
    "\n",
    "In particular, we will implement a \"model correlation heatmap\" block: we want to compare the predictions of all mixers inside a `BestOf` ensemble object, to understand how they might differ in their overall behavior."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2022-02-03T21:29:56.978393Z",
     "iopub.status.busy": "2022-02-03T21:29:56.977362Z",
     "iopub.status.idle": "2022-02-03T21:29:58.457474Z",
     "shell.execute_reply": "2022-02-03T21:29:58.457729Z"
    }
   },
   "outputs": [],
   "source": [
    "from typing import Dict, Tuple\n",
    "import pandas as pd\n",
    "import lightwood\n",
    "lightwood.__version__"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 1: figuring out what we need\n",
    "\n",
    "When designing an analysis block, an important choice needs to be made: will this block operate when calling the predictor? Or is it only going to describe its performance once in the held-out validation dataset?\n",
    "\n",
    "Being in the former case means we need to implement both `.analyze()` and `.explain()` methods, while the latter case only needs an `.analyze()` method. Our `ModelCorrelationHeatmap` belongs to this second category.\n",
    "\n",
    "Let's start the implementation by inheriting from `BaseAnalysisBlock`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2022-02-03T21:29:58.461457Z",
     "iopub.status.busy": "2022-02-03T21:29:58.461199Z",
     "iopub.status.idle": "2022-02-03T21:29:58.585428Z",
     "shell.execute_reply": "2022-02-03T21:29:58.585180Z"
    }
   },
   "outputs": [],
   "source": [
    "from lightwood.analysis import BaseAnalysisBlock\n",
    "\n",
    "class ModelCorrelationHeatmap(BaseAnalysisBlock):\n",
    "    def __init__(self, deps=tuple()):\n",
    "        super().__init__(deps=deps)\n",
    "        \n",
    "    def analyze(self, info: Dict[str, object], **kwargs) -> Dict[str, object]:\n",
    "        return info\n",
    "\n",
    "    def explain(self,\n",
    "                row_insights: pd.DataFrame,\n",
    "                global_insights: Dict[str, object], **kwargs) -> Tuple[pd.DataFrame, Dict[str, object]]:\n",
    "        \n",
    "        return row_insights, global_insights"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2022-02-03T21:29:58.588212Z",
     "iopub.status.busy": "2022-02-03T21:29:58.587939Z",
     "iopub.status.idle": "2022-02-03T21:29:58.589556Z",
     "shell.execute_reply": "2022-02-03T21:29:58.589754Z"
    }
   },
   "outputs": [],
   "source": [
    "ModelCorrelationHeatmap()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Right now, our newly created analysis block doesn't do much, apart from returning the `info` and insights (`row_insights` and `global_insights`) exactly as it received them from the previous block.\n",
    "\n",
    "As previously discussed, we only need to implement a procedure that runs post-training, no action is required at inference time. This means we can use the default `.explain()` behavior in the parent class:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2022-02-03T21:29:58.592434Z",
     "iopub.status.busy": "2022-02-03T21:29:58.592181Z",
     "iopub.status.idle": "2022-02-03T21:29:58.593455Z",
     "shell.execute_reply": "2022-02-03T21:29:58.593652Z"
    }
   },
   "outputs": [],
   "source": [
    "class ModelCorrelationHeatmap(BaseAnalysisBlock):\n",
    "    def __init__(self, deps=tuple()):\n",
    "        super().__init__(deps=deps)\n",
    "        \n",
    "    def analyze(self, info: Dict[str, object], **kwargs) -> Dict[str, object]:\n",
    "        return info"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 2: Implementing the custom analysis block\n",
    "\n",
    "Okay, now for the fun bit: we have to implement a correlation heatmap between the predictions of all mixers inside a `BestOf` ensemble. This is currently the only ensemble implemented in Lightwood, but it is a good idea to explicitly check that the type of the ensemble is what we expect.\n",
    "\n",
    "A natural question to ask at this point is: what information do we have to implement the procedure? You'll note that, apart from the `info` dictionary, we receive a `kwargs` dictionary. You can check out the full documentation for more details, but the keys (and respective value types) exposed in this object by default are:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2022-02-03T21:29:58.596282Z",
     "iopub.status.busy": "2022-02-03T21:29:58.596028Z",
     "iopub.status.idle": "2022-02-03T21:29:58.597193Z",
     "shell.execute_reply": "2022-02-03T21:29:58.597502Z"
    }
   },
   "outputs": [],
   "source": [
    "kwargs = {\n",
    "        'predictor': 'lightwood.ensemble.BaseEnsemble',\n",
    "        'target': 'str',\n",
    "        'input_cols': 'list',\n",
    "        'dtype_dict': 'dict',\n",
    "        'normal_predictions': 'pd.DataFrame',\n",
    "        'data': 'pd.DataFrame',\n",
    "        'train_data': 'lightwood.data.encoded_ds.EncodedDs',\n",
    "        'encoded_val_data': 'lightwood.data.encoded_ds.EncodedDs',\n",
    "        'is_classification': 'bool',\n",
    "        'is_numerical': 'bool',\n",
    "        'is_multi_ts': 'bool',\n",
    "        'stats_info': 'lightwood.api.types.StatisticalAnalysis',\n",
    "        'ts_cfg': 'lightwood.api.types.TimeseriesSettings',\n",
    "        'accuracy_functions': 'list',\n",
    "        'has_pretrained_text_enc': 'bool'\n",
    "}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As you can see there is lots to work with, but for this example we will focus on using:\n",
    "\n",
    "1. The `predictor` ensemble\n",
    "2. The `encoded_val_data` to generate predictions for each mixer inside the ensemble\n",
    "\n",
    "And the insight we're want to produce is a matrix that compares the output of all mixers and computes the correlation between them.\n",
    "\n",
    "Let's implement the algorithm:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2022-02-03T21:29:58.600174Z",
     "iopub.status.busy": "2022-02-03T21:29:58.599887Z",
     "iopub.status.idle": "2022-02-03T21:29:58.601638Z",
     "shell.execute_reply": "2022-02-03T21:29:58.601837Z"
    }
   },
   "outputs": [],
   "source": [
    "%%writefile model_correlation.py\n",
    "\n",
    "from typing import Dict\n",
    "from types import SimpleNamespace\n",
    "\n",
    "import numpy as np\n",
    "\n",
    "from lightwood.ensemble import BestOf\n",
    "from lightwood.analysis import BaseAnalysisBlock\n",
    "\n",
    "\n",
    "class ModelCorrelationHeatmap(BaseAnalysisBlock):\n",
    "    def __init__(self, deps=tuple()):\n",
    "        super().__init__(deps=deps)\n",
    "        \n",
    "    def analyze(self, info: Dict[str, object], **kwargs) -> Dict[str, object]:\n",
    "        ns = SimpleNamespace(**kwargs)\n",
    "        \n",
    "        # only triggered with the right type of ensemble\n",
    "        if isinstance(ns.predictor, BestOf):\n",
    "            \n",
    "            # store prediction from every mixer\n",
    "            all_predictions = []\n",
    "\n",
    "            for mixer in ns.predictor.mixers:\n",
    "                predictions = mixer(ns.encoded_val_data)['prediction'].values  # retrieve np.ndarray from the returned pd.DataFrame\n",
    "                all_predictions.append(predictions.flatten().astype(int))  # flatten and cast labels to int\n",
    "\n",
    "            # calculate correlation matrix\n",
    "            corrs = np.corrcoef(np.array(all_predictions))\n",
    "            \n",
    "            # save inside `info` object\n",
    "            info['mixer_correlation'] = corrs\n",
    "        \n",
    "        return info\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Notice the use of `SimpleNamespace` for dot notation accessors.\n",
    "\n",
    "The procedure above is fairly straightforward, as we leverage numpy's `corrcoef()` function to generate the matrix. \n",
    "\n",
    "Finally, it is very important to add the output to `info` so that it is saved inside the actual predictor object. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 3: Exposing the block to Lightwood\n",
    "\n",
    "\n",
    "To use this in an arbitrary script, we need to add the above class (and all necessary imports) to a `.py` file inside one of the following directories:\n",
    "\n",
    "* `~/lightwood_modules` (where `~` is your home directory, e.g. `/Users/username/` for macOS and `/home/username/` for linux\n",
    "* `/etc/lightwood_modules`\n",
    "\n",
    "Lightwood will scan these directories and import any class so that they can be found and used by the `JsonAI` code generating module.\n",
    "\n",
    "**To continue, please save the code cell above as `model_correlation.py` in one of the indicated directories.**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 4: Final test run\n",
    "\n",
    "Ok! Everything looks set to try out our custom block. Let's generate a predictor for [this](https://github.com/mindsdb/lightwood/blob/stable/tests/data/hdi.csv) sample dataset, and see whether our new insights are any good.\n",
    "\n",
    "First, it is important to add our `ModelCorrelationHeatmap` to the `analysis_blocks` attribute of the Json AI object that will generate your predictor code. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2022-02-03T21:29:58.604529Z",
     "iopub.status.busy": "2022-02-03T21:29:58.604262Z",
     "iopub.status.idle": "2022-02-03T21:29:59.210752Z",
     "shell.execute_reply": "2022-02-03T21:29:59.210964Z"
    }
   },
   "outputs": [],
   "source": [
    "from lightwood.api.high_level import ProblemDefinition, json_ai_from_problem, load_custom_module\n",
    "\n",
    "# First, load the custom module we wrote\n",
    "load_custom_module('model_correlation.py')\n",
    "\n",
    "# read dataset\n",
    "df = pd.read_csv('https://raw.githubusercontent.com/mindsdb/lightwood/main/tests/data/hdi.csv')\n",
    "\n",
    "# define the predictive task\n",
    "pdef = ProblemDefinition.from_dict({\n",
    "    'target': 'Development Index',         # column you want to predict\n",
    "    'time_aim': 100,\n",
    "})\n",
    "\n",
    "# generate the Json AI intermediate representation from the data and its corresponding settings\n",
    "json_ai = json_ai_from_problem(df, problem_definition=pdef)\n",
    "\n",
    "# add the custom list of analysis blocks; in this case, composed of a single block\n",
    "json_ai.analysis_blocks = [{\n",
    "    'module': 'model_correlation.ModelCorrelationHeatmap',\n",
    "    'args': {}\n",
    "}]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can take a look at the respective Json AI key just to confirm our newly added analysis block is in there:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2022-02-03T21:29:59.213815Z",
     "iopub.status.busy": "2022-02-03T21:29:59.213557Z",
     "iopub.status.idle": "2022-02-03T21:29:59.215126Z",
     "shell.execute_reply": "2022-02-03T21:29:59.214910Z"
    }
   },
   "outputs": [],
   "source": [
    "json_ai.analysis_blocks"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we are ready to create a predictor from this Json AI, and subsequently train it:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2022-02-03T21:29:59.218326Z",
     "iopub.status.busy": "2022-02-03T21:29:59.218052Z",
     "iopub.status.idle": "2022-02-03T21:30:04.805303Z",
     "shell.execute_reply": "2022-02-03T21:30:04.805568Z"
    },
    "scrolled": false
   },
   "outputs": [],
   "source": [
    "from lightwood.api.high_level import code_from_json_ai, predictor_from_code\n",
    "\n",
    "code = code_from_json_ai(json_ai)\n",
    "predictor = predictor_from_code(code)\n",
    "\n",
    "predictor.learn(df)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Finally, we can visualize the mixer correlation matrix:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2022-02-03T21:30:04.822276Z",
     "iopub.status.busy": "2022-02-03T21:30:04.821591Z",
     "iopub.status.idle": "2022-02-03T21:30:04.861450Z",
     "shell.execute_reply": "2022-02-03T21:30:04.861243Z"
    }
   },
   "outputs": [],
   "source": [
    "import matplotlib.pyplot as plt\n",
    "import numpy as np\n",
    "\n",
    "mc = predictor.runtime_analyzer['mixer_correlation']  # newly produced insight\n",
    "\n",
    "mixer_names = [c.__class__.__name__ for c in predictor.ensemble.mixers]\n",
    "\n",
    "# plotting code\n",
    "fig, ax = plt.subplots()\n",
    "im = ax.imshow(mc, cmap='seismic')\n",
    "\n",
    "# set ticks\n",
    "ax.set_xticks(np.arange(mc.shape[0]))\n",
    "ax.set_yticks(np.arange(mc.shape[1]))\n",
    "\n",
    "# set tick labels\n",
    "ax.set_xticklabels(mixer_names)\n",
    "ax.set_yticklabels(mixer_names)\n",
    "\n",
    "# show cell values\n",
    "for i in range(len(mixer_names)):\n",
    "    for j in range(len(mixer_names)):\n",
    "        text = ax.text(j, i, round(mc[i, j], 3), ha=\"center\", va=\"center\", color=\"w\")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Nice! We've just added an additional piece of insight regarding the predictor that Lightwood came up with for the task of predicting the Human Development Index of any given country.\n",
    "\n",
    "What this matrix is telling us is whether predictions of each pair of the mixers stored in the ensemble have a high correlation or not.\n",
    "\n",
    "This is, of course, a very simple example, but it shows the convenience of such an abstraction within the broader pipeline that Lightwood automates.\n",
    "\n",
    "For more complex examples, you can check out any of the three core analysis blocks that we use:\n",
    "\n",
    "* `lightwood.analysis.nc.calibrate.ICP`\n",
    "* `lightwood.analysis.helpers.acc_stats.AccStats`\n",
    "* `lightwood.analysis.helpers.feature_importance.PermutationFeatureImportance`\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "interpreter": {
   "hash": "31f2aee4e71d21fbe5cf8b01ff0e069b9275f58929596ceb00d14d90e3e16cd6"
  },
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.14"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}


================================================
FILE: docssrc/source/tutorials/custom_mixer/custom_mixer.ipynb
================================================
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Tutorial - Implementing a custom mixer in Lightwood\n",
    "\n",
    "\n",
    "## Introduction\n",
    "\n",
    "Mixers are the center piece of lightwood, tasked with learning the mapping between the encoded feature and target representation\n",
    "\n",
    "\n",
    "## Objective\n",
    "\n",
    "In this tutorial we'll be trying to implement a sklearn random forest as a mixer that handles categorical and binary targets. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 1: The Mixer Interface\n",
    "\n",
    "The Mixer interface is defined by the `BaseMixer` class, a mixer needs methods for 4 tasks:\n",
    "* fitting (`fit`)\n",
    "* predicting (`__call__`)\n",
    "* construction (`__init__`)\n",
    "* partial fitting (`partial_fit`), though this one is optional"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 2: Writing our mixer\n",
    "\n",
    "I'm going to create a file called `random_forest_mixer.py` inside `/etc/lightwood_modules`, this is where lightwood sources custom modules from.\n",
    "\n",
    "Inside of it I'm going to write the following code:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2022-02-03T21:37:10.384583Z",
     "iopub.status.busy": "2022-02-03T21:37:10.383368Z",
     "iopub.status.idle": "2022-02-03T21:37:10.389048Z",
     "shell.execute_reply": "2022-02-03T21:37:10.389617Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Overwriting random_forest_mixer.py\n"
     ]
    }
   ],
   "source": [
    "%%writefile random_forest_mixer.py\n",
    "\n",
    "from lightwood.mixer import BaseMixer\n",
    "from lightwood.api.types import PredictionArguments\n",
    "from lightwood.data.encoded_ds import EncodedDs, ConcatedEncodedDs\n",
    "from type_infer.dtype import dtype\n",
    "from lightwood.encoder import BaseEncoder\n",
    "\n",
    "import torch\n",
    "import pandas as pd\n",
    "from sklearn.ensemble import RandomForestClassifier\n",
    "\n",
    "\n",
    "class RandomForestMixer(BaseMixer):\n",
    "    clf: RandomForestClassifier\n",
    "\n",
    "    def __init__(self, stop_after: int, dtype_dict: dict, target: str, target_encoder: BaseEncoder):\n",
    "        super().__init__(stop_after)\n",
    "        self.target_encoder = target_encoder\n",
    "        # Throw in case someone tries to use this for a problem that's not classification, I'd fail anyway, but this way the error message is more intuitive\n",
    "        if dtype_dict[target] not in (dtype.categorical, dtype.binary):\n",
    "            raise Exception(f'This mixer can only be used for classification problems! Got target dtype {dtype_dict[target]} instead!')\n",
    "\n",
    "        # We could also initialize this in `fit` if some of the parameters depend on the input data, since `fit` is called exactly once\n",
    "        self.clf = RandomForestClassifier(max_depth=30)\n",
    "\n",
    "    def fit(self, train_data: EncodedDs, dev_data: EncodedDs) -> None:\n",
    "        X, Y = [], []\n",
    "        # By default mixers get some train data and a bit of dev data on which to do early stopping or hyper parameter optimization. For this mixer, we don't need dev data, so we're going to concat the two in order to get more training data. Then, we're going to turn them into an sklearn friendly foramat.\n",
    "        for x, y in ConcatedEncodedDs([train_data, dev_data]):\n",
    "            X.append(x.tolist())\n",
    "            Y.append(y.tolist())\n",
    "        self.clf.fit(X, Y)\n",
    "\n",
    "    def __call__(self, ds: EncodedDs,\n",
    "                 args: PredictionArguments = PredictionArguments()) -> pd.DataFrame:\n",
    "        # Turn the data into an sklearn friendly format\n",
    "        X = []\n",
    "        for x, _ in ds:\n",
    "            X.append(x.tolist())\n",
    "\n",
    "        Yh = self.clf.predict(X)\n",
    "\n",
    "        # Lightwood encoders are meant to decode torch tensors, so we have to cast the predictions first\n",
    "        decoded_predictions = self.target_encoder.decode(torch.Tensor(Yh))\n",
    "\n",
    "        # Finally, turn the decoded predictions into a dataframe with a single column called `prediction`. This is the standard behaviour all lightwood mixers use\n",
    "        ydf = pd.DataFrame({'prediction': decoded_predictions})\n",
    "\n",
    "        return ydf\n",
    "\n",
    "    \n",
    "    # We'll skip implementing `partial_fit`, thus making this mixer unsuitable for online training tasks"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 3: Using our mixer\n",
    "\n",
    "We're going to use our mixer for diagnosing heart disease using this dataset: [https://github.com/mindsdb/benchmarks/blob/main/benchmarks/datasets/heart_disease/data.csv](https://github.com/mindsdb/benchmarks/blob/main/benchmarks/datasets/heart_disease/data.csv)\n",
    "\n",
    "First, since we don't want to bother writing a Json AI for this dataset from scratch, we're going to let lightwood auto generate one."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2022-02-03T21:37:10.396795Z",
     "iopub.status.busy": "2022-02-03T21:37:10.396197Z",
     "iopub.status.idle": "2022-02-03T21:37:12.621715Z",
     "shell.execute_reply": "2022-02-03T21:37:12.621913Z"
    }
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\u001B[32mINFO:lightwood-1468487:Dropping features: []\u001B[0m\n",
      "\u001B[32mINFO:lightwood-1468487:Analyzing a sample of 298\u001B[0m\n",
      "\u001B[32mINFO:lightwood-1468487:from a total population of 303, this is equivalent to 98.3% of your data.\u001B[0m\n",
      "\u001B[32mINFO:lightwood-1468487:Using 7 processes to deduct types.\u001B[0m\n",
      "\u001B[32mINFO:lightwood-1468487:Infering type for: cp\u001B[0m\n",
      "\u001B[32mINFO:lightwood-1468487:Infering type for: age\u001B[0m\n",
      "\u001B[32mINFO:lightwood-1468487:Infering type for: sex\u001B[0m\n",
      "\u001B[32mINFO:lightwood-1468487:Infering type for: trestbps\u001B[0m\n",
      "\u001B[32mINFO:lightwood-1468487:Infering type for: chol\u001B[0m\n",
      "\u001B[32mINFO:lightwood-1468487:Infering type for: restecg\u001B[0m\n",
      "\u001B[32mINFO:lightwood-1468487:Infering type for: fbs\u001B[0m\n",
      "\u001B[32mINFO:lightwood-1468487:Column cp has data type categorical\u001B[0m\n",
      "\u001B[32mINFO:lightwood-1468487:Column chol has data type integer\u001B[0m\n",
      "\u001B[32mINFO:lightwood-1468487:Column sex has data type binary\u001B[0m\n",
      "\u001B[32mINFO:lightwood-1468487:Column restecg has data type categorical\u001B[0m\n",
      "\u001B[32mINFO:lightwood-1468487:Column trestbps has data type integer\u001B[0m\n",
      "\u001B[32mINFO:lightwood-1468487:Column age has data type integer\u001B[0m\n",
      "\u001B[32mINFO:lightwood-1468487:Column fbs has data type binary\u001B[0m\n",
      "\u001B[32mINFO:lightwood-1468487:Infering type for: thalach\u001B[0m\n",
      "\u001B[32mINFO:lightwood-1468487:Infering type for: exang\u001B[0m\n",
      "\u001B[32mINFO:lightwood-1468487:Infering type for: ca\u001B[0m\n",
      "\u001B[32mINFO:lightwood-1468487:Infering type for: slope\u001B[0m\n",
      "\u001B[32mINFO:lightwood-1468487:Infering type for: oldpeak\u001B[0m\n",
      "\u001B[32mINFO:lightwood-1468487:Column thalach has data type integer\u001B[0m\n",
      "\u001B[32mINFO:lightwood-1468487:Infering type for: thal\u001B[0m\n",
      "\u001B[32mINFO:lightwood-1468487:Column exang has data type binary\u001B[0m\n",
      "\u001B[32mINFO:lightwood-1468487:Column ca has data type categorical\u001B[0m\n",
      "\u001B[32mINFO:lightwood-1468487:Infering type for: target\u001B[0m\n",
      "\u001B[32mINFO:lightwood-1468487:Column thal has data type categorical\u001B[0m\n",
      "\u001B[32mINFO:lightwood-1468487:Column target has data type binary\u001B[0m\n",
      "\u001B[32mINFO:lightwood-1468487:Column oldpeak has data type float\u001B[0m\n",
      "\u001B[32mINFO:lightwood-1468487:Column slope has data type categorical\u001B[0m\n",
      "\u001B[32mINFO:lightwood-1468487:Starting statistical analysis\u001B[0m\n",
      "\u001B[32mINFO:lightwood-1468487:Finished statistical analysis\u001B[0m\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{\n",
      "    \"encoders\": {\n",
      "        \"target\": {\n",
      "            \"module\": \"BinaryEncoder\",\n",
      "            \"args\": {\n",
      "                \"is_target\": \"True\",\n",
      "                \"target_weights\": \"$statistical_analysis.target_weights\"\n",
      "            }\n",
      "        },\n",
      "        \"age\": {\n",
      "            \"module\": \"NumericEncoder\",\n",
      "            \"args\": {}\n",
      "        },\n",
      "        \"sex\": {\n",
      "            \"module\": \"BinaryEncoder\",\n",
      "            \"args\": {}\n",
      "        },\n",
      "        \"cp\": {\n",
      "            \"module\": \"OneHotEncoder\",\n",
      "            \"args\": {}\n",
      "        },\n",
      "        \"trestbps\": {\n",
      "            \"module\": \"NumericEncoder\",\n",
      "            \"args\": {}\n",
      "        },\n",
      "        \"chol\": {\n",
      "            \"module\": \"NumericEncoder\",\n",
      "            \"args\": {}\n",
      "        },\n",
      "        \"fbs\": {\n",
      "            \"module\": \"BinaryEncoder\",\n",
      "            \"args\": {}\n",
      "        },\n",
      "        \"restecg\": {\n",
      "            \"module\": \"OneHotEncoder\",\n",
      "            \"args\": {}\n",
      "        },\n",
      "        \"thalach\": {\n",
      "            \"module\": \"NumericEncoder\",\n",
      "            \"args\": {}\n",
      "        },\n",
      "        \"exang\": {\n",
      "            \"module\": \"BinaryEncoder\",\n",
      "            \"args\": {}\n",
      "        },\n",
      "        \"oldpeak\": {\n",
      "            \"module\": \"NumericEncoder\",\n",
      "            \"args\": {}\n",
      "        },\n",
      "        \"slope\": {\n",
      "            \"module\": \"OneHotEncoder\",\n",
      "            \"args\": {}\n",
      "        },\n",
      "        \"ca\": {\n",
      "            \"module\": \"OneHotEncoder\",\n",
      "            \"args\": {}\n",
      "        },\n",
      "        \"thal\": {\n",
      "            \"module\": \"OneHotEncoder\",\n",
      "            \"args\": {}\n",
      "        }\n",
      "    },\n",
      "    \"dtype_dict\": {\n",
      "        \"age\": \"integer\",\n",
      "        \"sex\": \"binary\",\n",
      "        \"cp\": \"categorical\",\n",
      "        \"trestbps\": \"integer\",\n",
      "        \"chol\": \"integer\",\n",
      "        \"fbs\": \"binary\",\n",
      "        \"restecg\": \"categorical\",\n",
      "        \"thalach\": \"integer\",\n",
      "        \"exang\": \"binary\",\n",
      "        \"oldpeak\": \"float\",\n",
      "        \"slope\": \"categorical\",\n",
      "        \"ca\": \"categorical\",\n",
      "        \"thal\": \"categorical\",\n",
      "        \"target\": \"binary\"\n",
      "    },\n",
      "    \"dependency_dict\": {},\n",
      "    \"model\": {\n",
      "        \"module\": \"BestOf\",\n",
      "        \"args\": {\n",
      "            \"submodels\": [\n",
      "                {\n",
      "                    \"module\": \"Neural\",\n",
      "                    \"args\": {\n",
      "                        \"fit_on_dev\": true,\n",
      "                        \"stop_after\": \"$problem_definition.seconds_per_mixer\",\n",
      "                        \"search_hyperparameters\": true\n",
      "                    }\n",
      "                },\n",
      "                {\n",
      "                    \"module\": \"LightGBM\",\n",
      "                    \"args\": {\n",
      "                        \"stop_after\": \"$problem_definition.seconds_per_mixer\",\n",
      "                        \"fit_on_dev\": true\n",
      "                    }\n",
      "                },\n",
      "                {\n",
      "                    \"module\": \"Regression\",\n",
      "                    \"args\": {\n",
      "                        \"stop_after\": \"$problem_definition.seconds_per_mixer\"\n",
      "                    }\n",
      "                }\n",
      "            ],\n",
      "            \"args\": \"$pred_args\",\n",
      "            \"accuracy_functions\": \"$accuracy_functions\",\n",
      "            \"ts_analysis\": null\n",
      "        }\n",
      "    },\n",
      "    \"problem_definition\": {\n",
      "        \"target\": \"target\",\n",
      "        \"pct_invalid\": 2,\n",
      "        \"unbias_target\": true,\n",
      "        \"seconds_per_mixer\": 57024.0,\n",
      "        \"seconds_per_encoder\": null,\n",
      "        \"expected_additional_time\": 0.1534867286682129,\n",
      "        \"time_aim\": 259200,\n",
      "        \"target_weights\": null,\n",
      "        \"positive_domain\": false,\n",
      "        \"timeseries_settings\": {\n",
      "            \"is_timeseries\": false,\n",
      "            \"order_by\": null,\n",
      "            \"window\": null,\n",
      "            \"group_by\": null,\n",
      "            \"use_previous_target\": true,\n",
      "            \"horizon\": null,\n",
      "            \"historical_columns\": null,\n",
      "            \"target_type\": \"\",\n",
      "            \"allow_incomplete_history\": true,\n",
      "            \"eval_cold_start\": true,\n",
      "            \"interval_periods\": []\n",
      "        },\n",
      "        \"anomaly_detection\": false,\n",
      "        \"use_default_analysis\": true,\n",
      "        \"ignore_features\": [],\n",
      "        \"fit_on_all\": true,\n",
      "        \"strict_mode\": true,\n",
      "        \"seed_nr\": 420\n",
      "    },\n",
      "    \"identifiers\": {},\n",
      "    \"accuracy_functions\": [\n",
      "        \"balanced_accuracy_score\"\n",
      "    ]\n",
      "}\n"
     ]
    }
   ],
   "source": [
    "from lightwood.api.high_level import ProblemDefinition, json_ai_from_problem, load_custom_module\n",
    "import pandas as pd\n",
    "\n",
    "# load the code\n",
    "load_custom_module('random_forest_mixer.py')\n",
    "\n",
    "# read dataset\n",
    "df = pd.read_csv('https://raw.githubusercontent.com/mindsdb/benchmarks/main/benchmarks/datasets/heart_disease/data.csv')\n",
    "\n",
    "# define the predictive task\n",
    "pdef = ProblemDefinition.from_dict({\n",
    "    'target': 'target', # column you want to predict\n",
    "})\n",
    "\n",
    "# generate the Json AI intermediate representation from the data and its corresponding settings\n",
    "json_ai = json_ai_from_problem(df, problem_definition=pdef)\n",
    "\n",
    "# Print it (you can also put it in a file and edit it there)\n",
    "print(json_ai.to_json())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we have to edit the `mixers` key of this json ai to tell lightwood to use our custom mixer. We can use it together with the others, and have it ensembled with them at the end, or standalone. In this case I'm going to replace all existing mixers with this one"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2022-02-03T21:37:12.626264Z",
     "iopub.status.busy": "2022-02-03T21:37:12.625986Z",
     "iopub.status.idle": "2022-02-03T21:37:12.626938Z",
     "shell.execute_reply": "2022-02-03T21:37:12.627155Z"
    }
   },
   "outputs": [],
   "source": [
    "json_ai.model['args']['submodels'] = [{\n",
    "    'module': 'random_forest_mixer.RandomForestMixer',\n",
    "    'args': {\n",
    "        'stop_after': '$problem_definition.seconds_per_mixer',\n",
    "        'dtype_dict': '$dtype_dict',\n",
    "        'target': '$target',\n",
    "                'target_encoder': '$encoders[self.target]'\n",
    "\n",
    "    }\n",
    "}]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Then we'll generate some code, and finally turn that code into a predictor object and fit it on the original data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2022-02-03T21:37:12.630176Z",
     "iopub.status.busy": "2022-02-03T21:37:12.629896Z",
     "iopub.status.idle": "2022-02-03T21:37:12.635767Z",
     "shell.execute_reply": "2022-02-03T21:37:12.636014Z"
    }
   },
   "outputs": [],
   "source": [
    "from lightwood.api.high_level import code_from_json_ai, predictor_from_code\n",
    "\n",
    "code = code_from_json_ai(json_ai)\n",
    "predictor = predictor_from_code(code)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2022-02-03T21:37:12.638323Z",
     "iopub.status.busy": "2022-02-03T21:37:12.638054Z",
     "iopub.status.idle": "2022-02-03T21:37:14.262880Z",
     "shell.execute_reply": "2022-02-03T21:37:14.263096Z"
    }
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\u001B[32mINFO:lightwood-1468487:Dropping features: []\u001B[0m\n",
      "\u001B[32mINFO:lightwood-1468487:Performing statistical analysis on data\u001B[0m\n",
      "\u001B[32mINFO:lightwood-1468487:Starting statistical analysis\u001B[0m\n",
      "\u001B[32mINFO:lightwood-1468487:Finished statistical analysis\u001B[0m\n",
      "\u001B[37mDEBUG:lightwood-1468487: `analyze_data` runtime: 0.01 seconds\u001B[0m\n",
      "\u001B[32mINFO:lightwood-1468487:Cleaning the data\u001B[0m\n",
      "\u001B[37mDEBUG:lightwood-1468487: `preprocess` runtime: 0.01 seconds\u001B[0m\n",
      "\u001B[32mINFO:lightwood-1468487:Splitting the data into train/test\u001B[0m\n",
      "\u001B[37mDEBUG:lightwood-1468487: `split` runtime: 0.01 seconds\u001B[0m\n",
      "\u001B[32mINFO:lightwood-1468487:Preparing the encoders\u001B[0m\n",
      "\u001B[32mINFO:lightwood-1468487:Encoder prepping dict length of: 1\u001B[0m\n",
      "\u001B[32mINFO:lightwood-1468487:Encoder prepping dict length of: 2\u001B[0m\n",
      "\u001B[32mINFO:lightwood-1468487:Encoder prepping dict length of: 3\u001B[0m\n",
      "\u001B[32mINFO:lightwood-1468487:Encoder prepping dict length of: 4\u001B[0m\n",
      "\u001B[32mINFO:lightwood-1468487:Encoder prepping dict length of: 5\u001B[0m\n",
      "\u001B[32mINFO:lightwood-1468487:Encoder prepping dict length of: 6\u001B[0m\n",
      "\u001B[32mINFO:lightwood-1468487:Encoder prepping dict length of: 7\u001B[0m\n",
      "\u001B[32mINFO:lightwood-1468487:Encoder prepping dict length of: 8\u001B[0m\n",
      "\u001B[32mINFO:lightwood-1468487:Encoder prepping dict length of: 9\u001B[0m\n",
      "\u001B[32mINFO:lightwood-1468487:Encoder prepping dict length of: 10\u001B[0m\n",
      "\u001B[32mINFO:lightwood-1468487:Encoder prepping dict length of: 11\u001B[0m\n",
      "\u001B[32mINFO:lightwood-1468487:Encoder prepping dict length of: 12\u001B[0m\n",
      "\u001B[32mINFO:lightwood-1468487:Encoder prepping dict length of: 13\u001B[0m\n",
      "\u001B[32mINFO:lightwood-1468487:Encoder prepping dict length of: 14\u001B[0m\n",
      "\u001B[32mINFO:lightwood-1468487:Encoding UNKNOWN categories as index 0\u001B[0m\n",
      "\u001B[32mINFO:lightwood-1468487:Encoding UNKNOWN categories as index 0\u001B[0m\n",
      "\u001B[32mINFO:lightwood-1468487:Encoding UNKNOWN categories as index 0\u001B[0m\n",
      "\u001B[32mINFO:lightwood-1468487:Encoding UNKNOWN categories as index 0\u001B[0m\n",
      "\u001B[32mINFO:lightwood-1468487:Encoding UNKNOWN categories as index 0\u001B[0m\n",
      "\u001B[32mINFO:lightwood-1468487:Done running for: target\u001B[0m\n",
      "\u001B[32mINFO:lightwood-1468487:Done running for: age\u001B[0m\n",
      "\u001B[32mINFO:lightwood-1468487:Done running for: sex\u001B[0m\n",
      "\u001B[32mINFO:lightwood-1468487:Done running for: cp\u001B[0m\n",
      "\u001B[32mINFO:lightwood-1468487:Done running for: trestbps\u001B[0m\n",
      "\u001B[32mINFO:lightwood-1468487:Done running for: chol\u001B[0m\n",
      "\u001B[32mINFO:lightwood-1468487:Done running for: fbs\u001B[0m\n",
      "\u001B[32mINFO:lightwood-1468487:Done running for: restecg\u001B[0m\n",
      "\u001B[32mINFO:lightwood-1468487:Done running for: thalach\u001B[0m\n",
      "\u001B[32mINFO:lightwood-1468487:Done running for: exang\u001B[0m\n",
      "\u001B[32mINFO:lightwood-1468487:Done running for: oldpeak\u001B[0m\n",
      "\u001B[32mINFO:lightwood-1468487:Done running for: slope\u001B[0m\n",
      "\u001B[32mINFO:lightwood-1468487:Done running for: ca\u001B[0m\n",
      "\u001B[32mINFO:lightwood-1468487:Done running for: thal\u001B[0m\n",
      "\u001B[37mDEBUG:lightwood-1468487: `prepare` runtime: 0.17 seconds\u001B[0m\n",
      "\u001B[32mINFO:lightwood-1468487:Featurizing the data\u001B[0m\n",
      "\u001B[37mDEBUG:lightwood-1468487: `featurize` runtime: 0.0 seconds\u001B[0m\n",
      "\u001B[32mINFO:lightwood-1468487:Training the mixers\u001B[0m\n",
      "\u001B[32mINFO:lightwood-1468487:Ensembling the mixer\u001B[0m\n",
      "\u001B[32mINFO:lightwood-1468487:Mixer: RandomForestMixer got accuracy: 0.8348214285714286\u001B[0m\n",
      "\u001B[32mINFO:lightwood-1468487:Picked best mixer: RandomForestMixer\u001B[0m\n",
      "\u001B[37mDEBUG:lightwood-1468487: `fit` runtime: 1.39 seconds\u001B[0m\n",
      "\u001B[32mINFO:lightwood-1468487:Analyzing the ensemble of mixers\u001B[0m\n",
      "\u001B[32mINFO:lightwood-1468487:The block ICP is now running its analyze() method\u001B[0m\n",
      "\u001B[32mINFO:lightwood-1468487:The block AccStats is now running its analyze() method\u001B[0m\n",
      "\u001B[32mINFO:lightwood-1468487:The block ConfStats is now running its analyze() method\u001B[0m\n",
      "\u001B[37mDEBUG:lightwood-1468487: `analyze_ensemble` runtime: 0.02 seconds\u001B[0m\n",
      "\u001B[32mINFO:lightwood-1468487:Adjustment on validation requested.\u001B[0m\n",
      "\u001B[32mINFO:lightwood-1468487:Updating the mixers\u001B[0m\n",
      "\u001B[37mDEBUG:lightwood-1468487: `adjust` runtime: 0.0 seconds\u001B[0m\n",
      "\u001B[37mDEBUG:lightwood-1468487: `learn` runtime: 1.62 seconds\u001B[0m\n"
     ]
    }
   ],
   "source": [
    "predictor.learn(df)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Finally, we can use the trained predictor to make some predictions, or save it to a pickle for later use"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2022-02-03T21:37:14.266136Z",
     "iopub.status.busy": "2022-02-03T21:37:14.265844Z",
     "iopub.status.idle": "2022-02-03T21:37:14.331096Z",
     "shell.execute_reply": "2022-02-03T21:37:14.330857Z"
    }
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\u001B[32mINFO:lightwood-1468487:Dropping features: []\u001B[0m\n",
      "\u001B[32mINFO:lightwood-1468487:Cleaning the data\u001B[0m\n",
      "\u001B[37mDEBUG:lightwood-1468487: `preprocess` runtime: 0.0 seconds\u001B[0m\n",
      "\u001B[32mINFO:lightwood-1468487:Featurizing the data\u001B[0m\n",
      "\u001B[37mDEBUG:lightwood-1468487: `featurize` runtime: 0.0 seconds\u001B[0m\n",
      "\u001B[32mINFO:lightwood-1468487:The block ICP is now running its explain() method\u001B[0m\n",
      "\u001B[32mINFO:lightwood-1468487:The block AccStats is now running its explain() method\u001B[0m\n",
      "\u001B[32mINFO:lightwood-1468487:AccStats.explain() has not been implemented, no modifications will be done to the data insights.\u001B[0m\n",
      "\u001B[32mINFO:lightwood-1468487:The block ConfStats is now running its explain() method\u001B[0m\n",
      "\u001B[32mINFO:lightwood-1468487:ConfStats.explain() has not been implemented, no modifications will be done to the data insights.\u001B[0m\n",
      "\u001B[37mDEBUG:lightwood-1468487: `predict` runtime: 0.03 seconds\u001B[0m\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "   original_index prediction  confidence\n",
      "0               0          1    0.972203\n",
      "1               1          0    0.989355\n",
      "2               2          1    0.849581\n"
     ]
    }
   ],
   "source": [
    "predictions = predictor.predict(pd.DataFrame({\n",
    "    'age': [63, 15, None],\n",
    "    'sex': [1, 1, 0],\n",
    "    'thal': [3, 1, 1]\n",
    "}))\n",
    "print(predictions)\n",
    "\n",
    "predictor.save('my_custom_heart_disease_predictor.pickle')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "That's it, all it takes to solve a predictive problem with lightwood using your own custom mixer."
   ]
  }
 ],
 "metadata": {
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}

================================================
FILE: docssrc/source/tutorials/custom_splitter/custom_splitter.ipynb
================================================
{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "israeli-spyware",
   "metadata": {},
   "source": [
    "## Build your own training/testing split\n",
    "\n",
    "#### Date: 2021.10.07\n",
    "\n",
    "When working with machine learning data, splitting into a \"train\", \"dev\" (or validation) and \"test\") set is important. Models use **train** data to learn representations and update their parameters; **dev** or validation data is reserved to see how the model may perform on unknown predictions. While it may not be explicitly trained on, it can be used as a stopping criteria, for hyper-parameter tuning, or as a simple sanity check. Lastly, **test** data is always reserved, hidden from the model, as a final pass to see what models perform best.\n",
    "\n",
    "Lightwood supports a variety of **encoders** (Feature engineering procedures) and **mixers** (predictor algorithms that go from feature vectors to the target). Given the diversity of algorithms, it is appropriate to split data into these three categories when *preparing* encoders or *fitting* mixers.\n",
    "\n",
    "Our default approach stratifies labeled data to ensure your train, validation, and test sets are equally represented in all classes. However, in many instances you may want a custom technique to build your own splits. We've included the `splitter` functionality (default found in `lightwood.data.splitter`) to enable you to build your own.\n",
    "\n",
    "In the following problem, we shall work with a Kaggle dataset around credit card fraud (found [here](https://www.kaggle.com/mlg-ulb/creditcardfraud)). Fraud detection is difficult because the events we are interested in catching are thankfully rare events. Because of that, there is a large **imbalance of classes** (in fact, in this dataset, less than 1% of the data are the rare-event).\n",
    "\n",
    "In a supervised technique, we may want to ensure our training data sees the rare event of interest. A random shuffle could potentially miss rare events. We will implement **SMOTE** to increase the number of positive classes in our training data.\n",
    "\n",
    "Let's get started!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "interim-discussion",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2022-02-03T21:35:52.595995Z",
     "iopub.status.busy": "2022-02-03T21:35:52.595085Z",
     "iopub.status.idle": "2022-02-03T21:35:54.062852Z",
     "shell.execute_reply": "2022-02-03T21:35:54.062541Z"
    }
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import pandas as pd\n",
    "import torch\n",
    "import nltk\n",
    "import matplotlib.pyplot as plt\n",
    "\n",
    "import os\n",
    "import sys\n",
    "\n",
    "# Lightwood modules\n",
    "import lightwood as lw\n",
    "from lightwood import ProblemDefinition, \\\n",
    "                      JsonAI, \\\n",
    "                      json_ai_from_problem, \\\n",
    "                      code_from_json_ai, \\\n",
    "                      predictor_from_code\n",
    "\n",
    "import imblearn # Vers 0.5.0 minimum requirement"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "decimal-techno",
   "metadata": {},
   "source": [
    "### 1) Load your data\n",
    "\n",
    "Lightwood works with `pandas` DataFrames. We can use pandas to load our data. Please download the dataset from the above link and place it in a folder called `data/` where this notebook is located."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "foreign-orchestra",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2022-02-03T21:35:54.067873Z",
     "iopub.status.busy": "2022-02-03T21:35:54.067601Z",
     "iopub.status.idle": "2022-02-03T21:36:14.674561Z",
     "shell.execute_reply": "2022-02-03T21:36:14.674349Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Time</th>\n",
       "      <th>V1</th>\n",
       "      <th>V2</th>\n",
       "      <th>V3</th>\n",
       "      <th>V4</th>\n",
       "      <th>V5</th>\n",
       "      <th>V6</th>\n",
       "      <th>V7</th>\n",
       "      <th>V8</th>\n",
       "      <th>V9</th>\n",
       "      <th>...</th>\n",
       "      <th>V21</th>\n",
       "      <th>V22</th>\n",
       "      <th>V23</th>\n",
       "      <th>V24</th>\n",
       "      <th>V25</th>\n",
       "      <th>V26</th>\n",
       "      <th>V27</th>\n",
       "      <th>V28</th>\n",
       "      <th>Amount</th>\n",
       "      <th>Class</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0.0</td>\n",
       "      <td>-1.359807</td>\n",
       "      <td>-0.072781</td>\n",
       "      <td>2.536347</td>\n",
       "      <td>1.378155</td>\n",
       "      <td>-0.338321</td>\n",
       "      <td>0.462388</td>\n",
       "      <td>0.239599</td>\n",
       "      <td>0.098698</td>\n",
       "      <td>0.363787</td>\n",
       "      <td>...</td>\n",
       "      <td>-0.018307</td>\n",
       "      <td>0.277838</td>\n",
       "      <td>-0.110474</td>\n",
       "      <td>0.066928</td>\n",
       "      <td>0.128539</td>\n",
       "      <td>-0.189115</td>\n",
       "      <td>0.133558</td>\n",
       "      <td>-0.021053</td>\n",
       "      <td>149.62</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>0.0</td>\n",
       "      <td>1.191857</td>\n",
       "      <td>0.266151</td>\n",
       "      <td>0.166480</td>\n",
       "      <td>0.448154</td>\n",
       "      <td>0.060018</td>\n",
       "      <td>-0.082361</td>\n",
       "      <td>-0.078803</td>\n",
       "      <td>0.085102</td>\n",
       "      <td>-0.255425</td>\n",
       "      <td>...</td>\n",
       "      <td>-0.225775</td>\n",
       "      <td>-0.638672</td>\n",
       "      <td>0.101288</td>\n",
       "      <td>-0.339846</td>\n",
       "      <td>0.167170</td>\n",
       "      <td>0.125895</td>\n",
       "      <td>-0.008983</td>\n",
       "      <td>0.014724</td>\n",
       "      <td>2.69</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>1.0</td>\n",
       "      <td>-1.358354</td>\n",
       "      <td>-1.340163</td>\n",
       "      <td>1.773209</td>\n",
       "      <td>0.379780</td>\n",
       "      <td>-0.503198</td>\n",
       "      <td>1.800499</td>\n",
       "      <td>0.791461</td>\n",
       "      <td>0.247676</td>\n",
       "      <td>-1.514654</td>\n",
       "      <td>...</td>\n",
       "      <td>0.247998</td>\n",
       "      <td>0.771679</td>\n",
       "      <td>0.909412</td>\n",
       "      <td>-0.689281</td>\n",
       "      <td>-0.327642</td>\n",
       "      <td>-0.139097</td>\n",
       "      <td>-0.055353</td>\n",
       "      <td>-0.059752</td>\n",
       "      <td>378.66</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1.0</td>\n",
       "      <td>-0.966272</td>\n",
       "      <td>-0.185226</td>\n",
       "      <td>1.792993</td>\n",
       "      <td>-0.863291</td>\n",
       "      <td>-0.010309</td>\n",
       "      <td>1.247203</td>\n",
       "      <td>0.237609</td>\n",
       "      <td>0.377436</td>\n",
       "      <td>-1.387024</td>\n",
       "      <td>...</td>\n",
       "      <td>-0.108300</td>\n",
       "      <td>0.005274</td>\n",
       "      <td>-0.190321</td>\n",
       "      <td>-1.175575</td>\n",
       "      <td>0.647376</td>\n",
       "      <td>-0.221929</td>\n",
       "      <td>0.062723</td>\n",
       "      <td>0.061458</td>\n",
       "      <td>123.50</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>2.0</td>\n",
       "      <td>-1.158233</td>\n",
       "      <td>0.877737</td>\n",
       "      <td>1.548718</td>\n",
       "      <td>0.403034</td>\n",
       "      <td>-0.407193</td>\n",
       "      <td>0.095921</td>\n",
       "      <td>0.592941</td>\n",
       "      <td>-0.270533</td>\n",
       "      <td>0.817739</td>\n",
       "      <td>...</td>\n",
       "      <td>-0.009431</td>\n",
       "      <td>0.798278</td>\n",
       "      <td>-0.137458</td>\n",
       "      <td>0.141267</td>\n",
       "      <td>-0.206010</td>\n",
       "      <td>0.502292</td>\n",
       "      <td>0.219422</td>\n",
       "      <td>0.215153</td>\n",
       "      <td>69.99</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5 rows × 31 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "   Time        V1        V2        V3        V4        V5        V6        V7  \\\n",
       "0   0.0 -1.359807 -0.072781  2.536347  1.378155 -0.338321  0.462388  0.239599   \n",
       "1   0.0  1.191857  0.266151  0.166480  0.448154  0.060018 -0.082361 -0.078803   \n",
       "2   1.0 -1.358354 -1.340163  1.773209  0.379780 -0.503198  1.800499  0.791461   \n",
       "3   1.0 -0.966272 -0.185226  1.792993 -0.863291 -0.010309  1.247203  0.237609   \n",
       "4   2.0 -1.158233  0.877737  1.548718  0.403034 -0.407193  0.095921  0.592941   \n",
       "\n",
       "         V8        V9  ...       V21       V22       V23       V24       V25  \\\n",
       "0  0.098698  0.363787  ... -0.018307  0.277838 -0.110474  0.066928  0.128539   \n",
       "1  0.085102 -0.255425  ... -0.225775 -0.638672  0.101288 -0.339846  0.167170   \n",
       "2  0.247676 -1.514654  ...  0.247998  0.771679  0.909412 -0.689281 -0.327642   \n",
       "3  0.377436 -1.387024  ... -0.108300  0.005274 -0.190321 -1.175575  0.647376   \n",
       "4 -0.270533  0.817739  ... -0.009431  0.798278 -0.137458  0.141267 -0.206010   \n",
       "\n",
       "        V26       V27       V28  Amount  Class  \n",
       "0 -0.189115  0.133558 -0.021053  149.62      0  \n",
       "1  0.125895 -0.008983  0.014724    2.69      0  \n",
       "2 -0.139097 -0.055353 -0.059752  378.66      0  \n",
       "3 -0.221929  0.062723  0.061458  123.50      0  \n",
       "4  0.502292  0.219422  0.215153   69.99      0  \n",
       "\n",
       "[5 rows x 31 columns]"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Load the data\n",
    "data = pd.read_csv(\"https://mindsdb-example-data.s3.eu-west-2.amazonaws.com/jupyter/creditcard.csv.zip\")\n",
    "data.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "rental-contribution",
   "metadata": {},
   "source": [
    "We see **31 columns**, most of these columns appear numerical. Due to confidentiality reasons, the Kaggle dataset mentions that the columns labeled $V_i$ indicate principle components (PCs) from a PCA analysis of the original data from the credit card company. There is also a \"Time\" and \"Amount\", two original features that remained. The time references time after the first transaction in the dataset, and amount is how much money was considered in the transaction. \n",
    "\n",
    "You can also see a heavy imbalance in the two classes below:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "cathedral-mills",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2022-02-03T21:36:14.679574Z",
     "iopub.status.busy": "2022-02-03T21:36:14.679297Z",
     "iopub.status.idle": "2022-02-03T21:36:15.080644Z",
     "shell.execute_reply": "2022-02-03T21:36:15.080366Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Text(0.5, 1.0, 'Distribution of Classes')"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAAYUAAAEWCAYAAACJ0YulAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjQuMywgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/MnkTPAAAACXBIWXMAAAsTAAALEwEAm
Download .txt
gitextract__fmgkupe/

├── .deepsource.toml
├── .flake8
├── .github/
│   ├── ISSUE_TEMPLATE/
│   │   ├── bug_report.md
│   │   ├── question.md
│   │   └── suggestion.md
│   ├── PULL_REQUEST_TEMPLATE/
│   │   └── pull_request_template.md
│   └── workflows/
│       ├── add_to_docs_project.yml
│       ├── add_to_roadmap_project.yml
│       ├── benchmark_check.yml
│       ├── cla.yml
│       ├── doc_build.yml
│       └── lightwood.yml
├── .gitignore
├── .nojekyll
├── CODE_OF_CONDUCT.md
├── CONTRIBUTING.md
├── LICENSE
├── README.md
├── __init__.py
├── assets/
│   └── contributions-agreement/
│       └── signatures/
│           └── cla.json
├── docssrc/
│   ├── Makefile
│   ├── README.md
│   └── source/
│       ├── _static/
│       │   └── custom.css
│       ├── analysis.rst
│       ├── api/
│       │   ├── dtype.rst
│       │   ├── encode.rst
│       │   ├── high_level.rst
│       │   ├── json_ai.rst
│       │   ├── predictor.rst
│       │   └── types.rst
│       ├── api.rst
│       ├── conf.py
│       ├── data.rst
│       ├── encoder.rst
│       ├── ensemble.rst
│       ├── helpers.rst
│       ├── index.rst
│       ├── lightwood_philosophy.rst
│       ├── mixer.rst
│       ├── tutorials/
│       │   ├── README.md
│       │   ├── custom_cleaner/
│       │   │   └── custom_cleaner.ipynb
│       │   ├── custom_encoder_rulebased/
│       │   │   └── custom_encoder_rulebased.ipynb
│       │   ├── custom_explainer/
│       │   │   └── custom_explainer.ipynb
│       │   ├── custom_mixer/
│       │   │   └── custom_mixer.ipynb
│       │   ├── custom_splitter/
│       │   │   └── custom_splitter.ipynb
│       │   ├── tutorial_data_analysis/
│       │   │   └── tutorial_data_analysis.ipynb
│       │   ├── tutorial_time_series/
│       │   │   └── tutorial_time_series.ipynb
│       │   └── tutorial_update_models/
│       │       └── tutorial_update_models.ipynb
│       └── tutorials.rst
├── lightwood/
│   ├── __about__.py
│   ├── __init__.py
│   ├── analysis/
│   │   ├── __init__.py
│   │   ├── analyze.py
│   │   ├── base.py
│   │   ├── explain.py
│   │   ├── helpers/
│   │   │   ├── __init__.py
│   │   │   ├── acc_stats.py
│   │   │   ├── conf_stats.py
│   │   │   ├── feature_importance.py
│   │   │   ├── pyod.py
│   │   │   └── shap.py
│   │   ├── nc/
│   │   │   ├── LICENSE
│   │   │   ├── __init__.py
│   │   │   ├── base.py
│   │   │   ├── calibrate.py
│   │   │   ├── icp.py
│   │   │   ├── metrics.py
│   │   │   ├── nc.py
│   │   │   ├── norm.py
│   │   │   └── util.py
│   │   └── nn_conf/
│   │       ├── __init__.py
│   │       ├── temp_scale.py
│   │       └── temp_scale_license
│   ├── api/
│   │   ├── __init__.py
│   │   ├── high_level.py
│   │   ├── json_ai.py
│   │   ├── predictor.py
│   │   └── types.py
│   ├── data/
│   │   ├── __init__.py
│   │   ├── encoded_ds.py
│   │   ├── timeseries_analyzer.py
│   │   └── timeseries_transform.py
│   ├── encoder/
│   │   ├── __init__.py
│   │   ├── array/
│   │   │   ├── __init__.py
│   │   │   ├── array.py
│   │   │   ├── ts_cat_array.py
│   │   │   └── ts_num_array.py
│   │   ├── audio/
│   │   │   ├── __init__.py
│   │   │   └── mfcc.py
│   │   ├── base.py
│   │   ├── categorical/
│   │   │   ├── __init__.py
│   │   │   ├── autoencoder.py
│   │   │   ├── binary.py
│   │   │   ├── gym.py
│   │   │   ├── multihot.py
│   │   │   ├── onehot.py
│   │   │   └── simple_label.py
│   │   ├── datetime/
│   │   │   ├── __init__.py
│   │   │   ├── datetime.py
│   │   │   └── datetime_sin_normalizer.py
│   │   ├── helpers.py
│   │   ├── identity/
│   │   │   ├── __init__.py
│   │   │   └── identity.py
│   │   ├── image/
│   │   │   ├── __init__.py
│   │   │   ├── helpers/
│   │   │   │   ├── __init__.py
│   │   │   │   └── img_to_vec.py
│   │   │   └── img_2_vec.py
│   │   ├── numeric/
│   │   │   ├── __init__.py
│   │   │   ├── numeric.py
│   │   │   └── ts_numeric.py
│   │   ├── text/
│   │   │   ├── __init__.py
│   │   │   ├── helpers/
│   │   │   │   ├── __init__.py
│   │   │   │   └── pretrained_helpers.py
│   │   │   ├── pretrained.py
│   │   │   ├── short.py
│   │   │   ├── tfidf.py
│   │   │   └── vocab.py
│   │   └── time_series/
│   │       ├── __init__.py
│   │       ├── helpers/
│   │       │   ├── __init__.py
│   │       │   ├── common.py
│   │       │   ├── rnn_helpers.py
│   │       │   └── transformer_helpers.py
│   │       ├── rnn.py
│   │       └── ts.py
│   ├── ensemble/
│   │   ├── __init__.py
│   │   ├── base.py
│   │   ├── best_of.py
│   │   ├── embed.py
│   │   ├── identity.py
│   │   ├── mean_ensemble.py
│   │   ├── mode_ensemble.py
│   │   ├── stacked_ensemble.py
│   │   ├── ts_stacked_ensemble.py
│   │   └── weighted_mean_ensemble.py
│   ├── helpers/
│   │   ├── __init__.py
│   │   ├── codegen.py
│   │   ├── constants.py
│   │   ├── device.py
│   │   ├── general.py
│   │   ├── io.py
│   │   ├── log.py
│   │   ├── numeric.py
│   │   ├── parallelism.py
│   │   ├── seed.py
│   │   ├── templating.py
│   │   ├── text.py
│   │   ├── torch.py
│   │   └── ts.py
│   └── mixer/
│       ├── __init__.py
│       ├── arima.py
│       ├── base.py
│       ├── ets.py
│       ├── helpers/
│       │   ├── __init__.py
│       │   ├── ar_net.py
│       │   ├── default_net.py
│       │   ├── qclassic_net.py
│       │   ├── ranger.py
│       │   ├── residual_net.py
│       │   ├── transform_corss_entropy_loss.py
│       │   └── ts.py
│       ├── lightgbm.py
│       ├── lightgbm_array.py
│       ├── neural.py
│       ├── neural_ts.py
│       ├── nhits.py
│       ├── prophet.py
│       ├── qclassic.py
│       ├── random_forest.py
│       ├── regression.py
│       ├── sktime.py
│       ├── tabtransformer.py
│       ├── unit.py
│       ├── xgboost.py
│       └── xgboost_array.py
├── pyproject.toml
└── tests/
    ├── __init__.py
    ├── data/
    │   ├── airline_sentiment.csv
    │   ├── arrivals.csv
    │   ├── concrete_strength.csv
    │   ├── hdi.csv
    │   ├── house_sales.csv
    │   ├── ionosphere.csv
    │   ├── tripadvisor_binary_sample.csv
    │   └── wine_reviews_binary_sample.csv
    ├── integration/
    │   ├── __init__.py
    │   ├── advanced/
    │   │   ├── __init__.py
    │   │   ├── test_array.py
    │   │   ├── test_custom_modules.py
    │   │   ├── test_text_input.py
    │   │   └── test_timeseries.py
    │   └── basic/
    │       ├── __init__.py
    │       ├── notes.txt
    │       ├── test_airline.py
    │       ├── test_categorical.py
    │       ├── test_cleaner.py
    │       ├── test_embedding.py
    │       ├── test_ensembles.py
    │       ├── test_jsonai.py
    │       ├── test_model_selection.py
    │       ├── test_qclassic.py
    │       ├── test_regression.py
    │       ├── test_save_and_load.py
    │       └── test_weird_target_dist.py
    ├── unit_tests/
    │   ├── __init__.py
    │   ├── analysis/
    │   │   ├── __init__.py
    │   │   ├── test_nc_norm.py
    │   │   ├── test_pyod.py
    │   │   └── test_shap.py
    │   ├── api/
    │   │   └── README.md
    │   ├── data/
    │   │   ├── __init__.py
    │   │   └── test_transform_ts.py
    │   ├── encoder/
    │   │   ├── __init__.py
    │   │   ├── audio/
    │   │   │   ├── __init__.py
    │   │   │   └── test_mfcc.py
    │   │   ├── categorical/
    │   │   │   ├── __init__.py
    │   │   │   ├── test_autoencoder.py
    │   │   │   ├── test_binary.py
    │   │   │   ├── test_label.py
    │   │   │   ├── test_multihot.py
    │   │   │   └── test_onehot.py
    │   │   ├── date/
    │   │   │   ├── __init__.py
    │   │   │   └── test_datetime.py
    │   │   ├── identity/
    │   │   │   ├── __init__.py
    │   │   │   └── test_identity.py
    │   │   ├── images/
    │   │   │   ├── __init__.py
    │   │   │   └── test_img_2_vec.py
    │   │   ├── numeric/
    │   │   │   ├── __init__.py
    │   │   │   └── test_numeric.py
    │   │   ├── text/
    │   │   │   ├── __init__.py
    │   │   │   ├── neg.txt
    │   │   │   ├── pos.txt
    │   │   │   ├── test_pretrained.py
    │   │   │   ├── test_short.py
    │   │   │   ├── test_tfidf.py
    │   │   │   └── test_vocab.py
    │   │   └── time_series/
    │   │       ├── __init__.py
    │   │       ├── test_timeseries_rnn.py
    │   │       └── test_transformer.py
    │   ├── helpers.py
    │   └── mixer/
    │       ├── __init__.py
    │       ├── test_lgbm.py
    │       ├── test_nhits.py
    │       ├── test_random_forest.py
    │       ├── test_tabtransformer.py
    │       └── test_xgboost.py
    └── utils/
        ├── __init__.py
        ├── data_generation.py
        └── timing.py
Download .txt
SYMBOL INDEX (861 symbols across 139 files)

FILE: lightwood/analysis/analyze.py
  function model_analyzer (line 15) | def model_analyzer(

FILE: lightwood/analysis/base.py
  class BaseAnalysisBlock (line 7) | class BaseAnalysisBlock:
    method __init__ (line 9) | def __init__(self,
    method analyze (line 15) | def analyze(self, info: Dict[str, object], **kwargs) -> Dict[str, obje...
    method explain (line 30) | def explain(self,

FILE: lightwood/analysis/explain.py
  function explain (line 14) | def explain(data: pd.DataFrame,

FILE: lightwood/analysis/helpers/acc_stats.py
  class AccStats (line 13) | class AccStats(BaseAnalysisBlock):
    method __init__ (line 16) | def __init__(self, deps=('ICP',)):
    method analyze (line 20) | def analyze(self, info: Dict[str, object], **kwargs) -> Dict[str, obje...
    method fit (line 33) | def fit(self, ns: SimpleNamespace, conf=Optional[np.ndarray]):
    method get_accuracy_stats (line 78) | def get_accuracy_stats(self, is_classification=None, is_numerical=None):
  function get_value_bucket (line 143) | def get_value_bucket(value, buckets, target_dtype):
  function closest (line 164) | def closest(arr, value):

FILE: lightwood/analysis/helpers/conf_stats.py
  class ConfStats (line 10) | class ConfStats(BaseAnalysisBlock):
    method __init__ (line 17) | def __init__(self, deps=('ICP',), ece_bins: int = 10):
    method analyze (line 23) | def analyze(self, info: Dict[str, object], **kwargs) -> Dict[str, obje...
    method _get_stats (line 48) | def _get_stats(self, confs, preds, data, target, task_type='categorica...

FILE: lightwood/analysis/helpers/feature_importance.py
  class PermutationFeatureImportance (line 15) | class PermutationFeatureImportance(BaseAnalysisBlock):
    method __init__ (line 35) | def __init__(self, disable_column_importance=False, row_limit=1000, co...
    method analyze (line 42) | def analyze(self, info: Dict[str, object], **kwargs) -> Dict[str, obje...

FILE: lightwood/analysis/helpers/pyod.py
  class PyOD (line 19) | class PyOD(BaseAnalysisBlock):
    method __init__ (line 27) | def __init__(self, contamination=0.1, deps: Optional[Tuple] = ...):
    method analyze (line 37) | def analyze(self, info: Dict[str, object], **kwargs) -> Dict[str, obje...
    method explain (line 73) | def explain(self,
    method _preprocess_ts_df (line 92) | def _preprocess_ts_df(self, df: pd.DataFrame, ns: SimpleNamespace) -> ...

FILE: lightwood/analysis/helpers/shap.py
  class ShapleyValues (line 17) | class ShapleyValues(BaseAnalysisBlock):
    method __init__ (line 29) | def __init__(self, deps: Optional[Tuple] = ...):
    method analyze (line 35) | def analyze(self, info: Dict[str, object], **kwargs) -> Dict[str, obje...
    method explain (line 71) | def explain(self,

FILE: lightwood/analysis/nc/base.py
  class RegressorMixin (line 11) | class RegressorMixin(object):
    method __init__ (line 12) | def __init__(self) -> None:
    method get_problem_type (line 16) | def get_problem_type(cls):
  class ClassifierMixin (line 20) | class ClassifierMixin(object):
    method __init__ (line 21) | def __init__(self) -> None:
    method get_problem_type (line 25) | def get_problem_type(cls) -> str:
  class TSMixin (line 29) | class TSMixin(object):
    method __init__ (line 30) | def __init__(self) -> None:
    method get_problem_type (line 34) | def get_problem_type(cls):
  class BaseModelAdapter (line 38) | class BaseModelAdapter(BaseEstimator):
    method __init__ (line 41) | def __init__(self, model: object, fit_params: Dict[str, object] = None...
    method fit (line 49) | def fit(self, x: np.array, y: np.array) -> None:
    method predict (line 68) | def predict(self, x: np.array) -> np.array:
    method _underlying_predict (line 94) | def _underlying_predict(self, x: np.array) -> np.array:
  class ClassifierAdapter (line 110) | class ClassifierAdapter(BaseModelAdapter):
    method __init__ (line 111) | def __init__(self, model: object, fit_params: Dict[str, object] = None...
    method _underlying_predict (line 114) | def _underlying_predict(self, x: np.array) -> np.array:
  class RegressorAdapter (line 118) | class RegressorAdapter(BaseModelAdapter):
    method __init__ (line 119) | def __init__(self, model: object, fit_params: Dict[str, object] = None...
    method _underlying_predict (line 122) | def _underlying_predict(self, x: np.array) -> np.array:
  class TSAdapter (line 126) | class TSAdapter(BaseModelAdapter):
    method __init__ (line 127) | def __init__(self, model: object, fit_params: Dict[str, object] = None...
    method _underlying_predict (line 130) | def _underlying_predict(self, x: np.array) -> np.array:
  class CachedRegressorAdapter (line 134) | class CachedRegressorAdapter(RegressorAdapter):
    method __init__ (line 135) | def __init__(self, model, fit_params=None):
    method fit (line 139) | def fit(self, x=None, y=None):
    method predict (line 144) | def predict(self, x=None):
  class CachedClassifierAdapter (line 150) | class CachedClassifierAdapter(ClassifierAdapter):
    method __init__ (line 151) | def __init__(self, model, fit_params=None):
    method fit (line 156) | def fit(self, x=None, y=None):
    method predict (line 161) | def predict(self, x=None):
  class CachedTSAdapter (line 170) | class CachedTSAdapter(TSAdapter):
    method __init__ (line 171) | def __init__(self, model, fit_params=None):
    method fit (line 175) | def fit(self, x=None, y=None):
    method predict (line 178) | def predict(self, x=None):

FILE: lightwood/analysis/nc/calibrate.py
  class ICP (line 25) | class ICP(BaseAnalysisBlock):
    method __init__ (line 28) | def __init__(self,
    method analyze (line 38) | def analyze(self, info: Dict[str, object], **kwargs) -> Dict[str, obje...
    method explain (line 248) | def explain(self, row_insights: pd.DataFrame, global_insights: Dict[st...
    method _formatted (line 477) | def _formatted(row_insights, global_insights, ns, is_numerical):
    method _ts_assign_confs (line 518) | def _ts_assign_confs(result, df, confs, significances, tss) -> pd.Data...

FILE: lightwood/analysis/nc/icp.py
  class BaseIcp (line 17) | class BaseIcp(BaseEstimator):
    method __init__ (line 21) | def __init__(self, nc_function: FunctionType, condition: Union[bool, F...
    method fit (line 45) | def fit(self, x: np.array, y: np.array) -> None:
    method calibrate (line 63) | def calibrate(self, x, y, increment=False):
    method _reduce_scores (line 108) | def _reduce_scores(self):
    method _update_calibration_set (line 111) | def _update_calibration_set(self, x: np.array, y: np.array, increment:...
    method _calibrate_hook (line 118) | def _calibrate_hook(self, x: np.array, y: np.array, increment: bool) -...
  class IcpClassifier (line 125) | class IcpClassifier(BaseIcp, ClassifierMixin):
    method __init__ (line 164) | def __init__(self, nc_function: FunctionType, condition: Union[bool, F...
    method _calibrate_hook (line 170) | def _calibrate_hook(self, x: np.array, y: np.array, increment: bool = ...
    method _update_classes (line 173) | def _update_classes(self, y: np.array, increment: bool) -> None:
    method predict (line 179) | def predict(self, x: np.array, significance: Optional[float] = None) -...
    method predict_conf (line 229) | def predict_conf(self, x):
  class IcpRegressor (line 259) | class IcpRegressor(BaseIcp, RegressorMixin):
    method __init__ (line 295) | def __init__(self, nc_function: FunctionType, condition: bool = None, ...
    method predict (line 298) | def predict(self, x: np.array, significance: bool = None) -> np.array:
  class IcpTSRegressor (line 352) | class IcpTSRegressor(BaseIcp, TSMixin):
    method __init__ (line 374) | def __init__(self, nc_function: FunctionType, horizon_length, conditio...
    method calibrate (line 378) | def calibrate(self, x, y, increment=False):
    method predict (line 389) | def predict(self, x: np.array, significance: bool = None) -> np.array:

FILE: lightwood/analysis/nc/metrics.py
  function reg_n_correct (line 9) | def reg_n_correct(prediction, y, significance=None):
  function reg_mean_errors (line 24) | def reg_mean_errors(prediction, y, significance):
  function class_n_correct (line 30) | def class_n_correct(prediction, y, significance):
  function class_mean_errors (line 42) | def class_mean_errors(prediction, y, significance=None):
  function class_one_err (line 48) | def class_one_err(prediction, y, significance=None):
  function class_mean_errors_one_class (line 64) | def class_mean_errors_one_class(prediction, y, significance, c=0):
  function class_one_err_one_class (line 80) | def class_one_err_one_class(prediction, y, significance, c=0):
  function _reg_interval_size (line 101) | def _reg_interval_size(prediction, y, significance):
  function reg_min_size (line 108) | def reg_min_size(prediction, y, significance):
  function reg_q1_size (line 112) | def reg_q1_size(prediction, y, significance):
  function reg_median_size (line 116) | def reg_median_size(prediction, y, significance):
  function reg_q3_size (line 120) | def reg_q3_size(prediction, y, significance):
  function reg_max_size (line 124) | def reg_max_size(prediction, y, significance):
  function reg_mean_size (line 128) | def reg_mean_size(prediction, y, significance):
  function class_avg_c (line 135) | def class_avg_c(prediction, y, significance):
  function class_mean_p_val (line 143) | def class_mean_p_val(prediction, y, significance):
  function class_one_c (line 150) | def class_one_c(prediction, y, significance):
  function class_empty (line 159) | def class_empty(prediction, y, significance):
  function n_test (line 168) | def n_test(prediction, y, significance):

FILE: lightwood/analysis/nc/nc.py
  class ClassificationErrFunc (line 17) | class ClassificationErrFunc(object):
    method __init__ (line 23) | def __init__(self):
    method apply (line 27) | def apply(self, prediction, y):
  class RegressionErrFunc (line 46) | class RegressionErrFunc(object):
    method __init__ (line 52) | def __init__(self):
    method apply (line 56) | def apply(self, prediction, y):  # , norm=None, beta=0):
    method apply_inverse (line 75) | def apply_inverse(self, nc, significance):  # , norm=None, beta=0):
  class TSErrFunc (line 95) | class TSErrFunc(object):
    method __init__ (line 101) | def __init__(self):
    method apply (line 105) | def apply(self, prediction, y):
    method apply_inverse (line 124) | def apply_inverse(self, nc, significance):  # , norm=None, beta=0):
  class InverseProbabilityErrFunc (line 144) | class InverseProbabilityErrFunc(ClassificationErrFunc):
    method __init__ (line 153) | def __init__(self):
    method apply (line 156) | def apply(self, prediction, y):
  class MarginErrFunc (line 166) | class MarginErrFunc(ClassificationErrFunc):
    method __init__ (line 176) | def __init__(self):
    method apply (line 179) | def apply(self, prediction, y):
  class AbsErrorErrFunc (line 191) | class AbsErrorErrFunc(RegressionErrFunc):
    method __init__ (line 200) | def __init__(self):
    method apply (line 203) | def apply(self, prediction, y):
    method apply_inverse (line 206) | def apply_inverse(self, nc, significance):
  class BoostedAbsErrorErrFunc (line 214) | class BoostedAbsErrorErrFunc(RegressionErrFunc):
    method __init__ (line 219) | def __init__(self):
    method apply (line 222) | def apply(self, prediction, y):
    method apply_inverse (line 225) | def apply_inverse(self, nc, significance):
  class SignErrorErrFunc (line 236) | class SignErrorErrFunc(RegressionErrFunc):
    method __init__ (line 251) | def __init__(self):
    method apply (line 254) | def apply(self, prediction, y):
    method apply_inverse (line 257) | def apply_inverse(self, nc, significance):
  class TSAbsErrorErrFunc (line 267) | class TSAbsErrorErrFunc(TSErrFunc):
    method __init__ (line 279) | def __init__(self, horizon_length):
    method apply (line 283) | def apply(self, prediction, y):
    method apply_inverse (line 287) | def apply_inverse(self, nc, significance):
  class BaseScorer (line 298) | class BaseScorer(sklearn.base.BaseEstimator):
    method __init__ (line 301) | def __init__(self):
    method fit (line 305) | def fit(self, x, y):
    method score (line 309) | def score(self, x, y=None):
  class RegressorNormalizer (line 313) | class RegressorNormalizer(BaseScorer):
    method __init__ (line 314) | def __init__(self, base_model, normalizer_model, err_func):
    method fit (line 320) | def fit(self, x, y):
    method score (line 327) | def score(self, x, y=None):
  class BaseModelNc (line 332) | class BaseModelNc(BaseScorer):
    method __init__ (line 353) | def __init__(self, model, err_func, normalizer=None, beta=0):
    method fit (line 371) | def fit(self, x, y):
    method score (line 391) | def score(self, x, y=None):
    method __deepcopy__ (line 419) | def __deepcopy__(self, memo={}):
  class ClassifierNc (line 434) | class ClassifierNc(BaseModelNc):
    method __init__ (line 464) | def __init__(self,
  class RegressorNc (line 478) | class RegressorNc(BaseModelNc):
    method __init__ (line 506) | def __init__(self,
    method predict (line 516) | def predict(self, x, nc, significance=None):
  class TSNc (line 582) | class TSNc(BaseModelNc):
    method __init__ (line 610) | def __init__(self,
    method predict (line 620) | def predict(self, x, nc, significance=None):

FILE: lightwood/analysis/nc/norm.py
  class Normalizer (line 15) | class Normalizer(BaseMixer):
    method __init__ (line 29) | def __init__(self, fit_params: dict):
    method fit (line 45) | def fit(self, data: EncodedDs) -> None:
    method __call__ (line 57) | def __call__(self, ds: Union[ConcatedEncodedDs, EncodedDs, np.ndarray]...
    method score (line 71) | def score(self, data) -> np.ndarray:
    method get_labels (line 81) | def get_labels(self, preds: pd.DataFrame, truths: np.ndarray, target_e...
    method compute_numerical_labels (line 119) | def compute_numerical_labels(preds: np.ndarray, truths: np.ndarray, bo...
    method compute_categorical_labels (line 126) | def compute_categorical_labels(preds: np.ndarray, truths: np.ndarray) ...

FILE: lightwood/analysis/nc/util.py
  function t_softmax (line 11) | def t_softmax(x, t=1.0, axis=1):
  function clean_df (line 16) | def clean_df(df, namespace, label_encoders):
  function set_conf_range (line 40) | def set_conf_range(
  function get_numeric_conf_range (line 106) | def get_numeric_conf_range(
  function get_ts_conf_range (line 164) | def get_ts_conf_range(
  function get_categorical_conf (line 189) | def get_categorical_conf(raw_confs: np.ndarray):
  function get_anomalies (line 207) | def get_anomalies(insights: pd.DataFrame, observed_series: Union[pd.Seri...

FILE: lightwood/analysis/nn_conf/temp_scale.py
  class TempScaler (line 14) | class TempScaler(BaseAnalysisBlock):
    method __init__ (line 19) | def __init__(self, deps=tuple()):
    method temperature_scale (line 26) | def temperature_scale(self, logits):
    method softmax (line 30) | def softmax(self, logits):
    method analyze (line 33) | def analyze(self, info: Dict[str, object], **kwargs) -> Dict[str, obje...
    method explain (line 85) | def explain(self,

FILE: lightwood/api/high_level.py
  function load_custom_module (line 18) | def load_custom_module(file_path: str):
  function predictor_from_problem (line 30) | def predictor_from_problem(df: pd.DataFrame, problem_definition: Union[P...
  function json_ai_from_problem (line 50) | def json_ai_from_problem(df: pd.DataFrame, problem_definition: Union[Pro...
  function code_from_json_ai (line 88) | def code_from_json_ai(json_ai: JsonAI) -> str:
  function predictor_from_code (line 99) | def predictor_from_code(code: str) -> PredictorInterface:
  function code_from_problem (line 108) | def code_from_problem(df: pd.DataFrame, problem_definition: Union[Proble...
  function predictor_from_state (line 127) | def predictor_from_state(state_file: str, code: str = None) -> Predictor...
  function predictor_from_json_ai (line 157) | def predictor_from_json_ai(json_ai: JsonAI) -> PredictorInterface:

FILE: lightwood/api/json_ai.py
  function lookup_encoder (line 18) | def lookup_encoder(
  function generate_json_ai (line 143) | def generate_json_ai(
  function _merge_implicit_values (line 397) | def _merge_implicit_values(field: dict, implicit_value: dict) -> dict:
  function _populate_implicit_field (line 426) | def _populate_implicit_field(
  function add_implicit_values (line 470) | def add_implicit_values(json_ai: JsonAI) -> JsonAI:
  function validate_json_ai (line 725) | def validate_json_ai(json_ai: JsonAI) -> bool:

FILE: lightwood/api/predictor.py
  class PredictorInterface (line 9) | class PredictorInterface:
    method __init__ (line 39) | def __init__(self):
    method analyze_data (line 42) | def analyze_data(self, data: pd.DataFrame) -> None:
    method preprocess (line 50) | def preprocess(self, data: pd.DataFrame) -> pd.DataFrame:
    method split (line 59) | def split(self, data: pd.DataFrame) -> Dict[str, pd.DataFrame]:
    method prepare (line 68) | def prepare(self, data: Dict[str, pd.DataFrame]) -> None:
    method featurize (line 77) | def featurize(self, split_data: Dict[str, pd.DataFrame]):
    method fit (line 87) | def fit(self, enc_data: Dict[str, pd.DataFrame]) -> None:
    method analyze_ensemble (line 95) | def analyze_ensemble(self, enc_data: Dict[str, pd.DataFrame]) -> None:
    method learn (line 103) | def learn(self, data: pd.DataFrame) -> None:
    method adjust (line 115) | def adjust(self, new_data: pd.DataFrame, old_data: Optional[pd.DataFra...
    method predict (line 129) | def predict(self, data: pd.DataFrame, args: Dict[str, object] = {}) ->...
    method test (line 140) | def test(
    method save (line 155) | def save(self, file_path: str) -> None:
    method export (line 164) | def export(self, file_path: str, json_ai_code: str) -> None:

FILE: lightwood/api/types.py
  class Module (line 22) | class Module(TypedDict):
  class TimeseriesSettings (line 34) | class TimeseriesSettings:
    method from_dict (line 74) | def from_dict(obj: Dict):
    method from_json (line 113) | def from_json(data: str):
    method to_dict (line 123) | def to_dict(self, encode_json=False) -> Dict[str, Json]:
    method to_json (line 131) | def to_json(self) -> Dict[str, Json]:
  class ProblemDefinition (line 140) | class ProblemDefinition:
    method from_dict (line 196) | def from_dict(obj: Dict):
    method from_json (line 251) | def from_json(data: str):
    method to_dict (line 261) | def to_dict(self, encode_json=False) -> Dict[str, Json]:
    method to_json (line 269) | def to_json(self) -> Dict[str, Json]:
  class JsonAI (line 279) | class JsonAI:
    method from_dict (line 319) | def from_dict(obj: Dict):
    method from_json (line 360) | def from_json(data: str):
    method to_dict (line 364) | def to_dict(self, encode_json=False) -> Dict[str, Json]:
    method to_json (line 381) | def to_json(self) -> Dict[str, Json]:
  class SubmodelData (line 392) | class SubmodelData:
  class ModelAnalysis (line 400) | class ModelAnalysis:
  class PredictionArguments (line 430) | class PredictionArguments:
    method from_dict (line 462) | def from_dict(obj: Dict):
    method to_dict (line 498) | def to_dict(self, encode_json=False) -> Dict[str, Json]:

FILE: lightwood/data/encoded_ds.py
  class EncodedDs (line 10) | class EncodedDs(Dataset):
    method __init__ (line 11) | def __init__(self, encoders: Dict[str, BaseEncoder], data_frame: pd.Da...
    method __len__ (line 41) | def __len__(self):
    method __getitem__ (line 49) | def __getitem__(self, idx: int) -> Tuple[torch.Tensor, torch.Tensor]:
    method _encode_idxs (line 71) | def _encode_idxs(self, idxs: list):
    method get_column_original_data (line 114) | def get_column_original_data(self, column_name: str) -> pd.Series:
    method get_encoded_column_data (line 123) | def get_encoded_column_data(self, column_name: str) -> torch.Tensor:
    method get_encoded_data (line 158) | def get_encoded_data(self, include_target: bool = True) -> torch.Tensor:
    method build_cache (line 172) | def build_cache(self):
    method clear_cache (line 183) | def clear_cache(self):
  class ConcatedEncodedDs (line 190) | class ConcatedEncodedDs(EncodedDs):
    method __init__ (line 196) | def __init__(self, encoded_ds_arr: List[EncodedDs]) -> None:
    method __len__ (line 205) | def __len__(self):
    method __getitem__ (line 212) | def __getitem__(self, idx: int) -> Tuple[torch.Tensor, torch.Tensor]:
    method get_column_original_data (line 223) | def get_column_original_data(self, column_name: str) -> pd.Series:
    method get_encoded_column_data (line 230) | def get_encoded_column_data(self, column_name: str) -> torch.Tensor:
    method clear_cache (line 237) | def clear_cache(self):

FILE: lightwood/data/timeseries_analyzer.py
  function timeseries_analyzer (line 12) | def timeseries_analyzer(data: Dict[str, pd.DataFrame], dtype_dict: Dict[...
  function get_naive_residuals (line 57) | def get_naive_residuals(target_data: pd.DataFrame, m: int = 1) -> Tuple[...
  function get_grouped_naive_residuals (line 79) | def get_grouped_naive_residuals(
  function get_differencers (line 98) | def get_differencers(data: pd.DataFrame, target: str, group_cols: List):

FILE: lightwood/data/timeseries_transform.py
  function transform_timeseries (line 14) | def transform_timeseries(
  function _ts_infer_next_row (line 194) | def _ts_infer_next_row(df: pd.DataFrame, ob: str) -> pd.DataFrame:
  function _ts_to_obj (line 221) | def _ts_to_obj(df: pd.DataFrame, historical_columns: list) -> pd.DataFrame:
  function _ts_add_previous_rows (line 235) | def _ts_add_previous_rows(df: pd.DataFrame, order_cols: list, window: in...
  function _ts_add_previous_target (line 257) | def _ts_add_previous_target(df: pd.DataFrame, target: str, window: int) ...
  function _ts_add_future_target (line 284) | def _ts_add_future_target(df, target, horizon, data_dtype, mode):

FILE: lightwood/encoder/array/array.py
  class ArrayEncoder (line 11) | class ArrayEncoder(BaseEncoder):
    method __init__ (line 24) | def __init__(self, stop_after: float, window: int = None, is_target: b...
    method _pad_and_strip (line 40) | def _pad_and_strip(self, array: List[object]):
    method prepare (line 47) | def prepare(self, train_priming_data: Iterable[Iterable], dev_priming_...
    method encode (line 79) | def encode(self, column_data: Iterable[Iterable]) -> torch.Tensor:
    method decode (line 103) | def decode(self, data: torch.Tensor) -> List[Iterable]:
  class CatArrayEncoder (line 114) | class CatArrayEncoder(ArrayEncoder):
    method __init__ (line 115) | def __init__(self, stop_after: float, window: int = None, is_target: b...
    method prepare (line 118) | def prepare(self, train_priming_data: Iterable[Iterable], dev_priming_...
    method decode (line 122) | def decode(self, data: torch.Tensor) -> List[Iterable]:
  class NumArrayEncoder (line 128) | class NumArrayEncoder(ArrayEncoder):
    method __init__ (line 129) | def __init__(self, stop_after: float, window: int = None, is_target: b...

FILE: lightwood/encoder/array/ts_cat_array.py
  class TsCatArrayEncoder (line 10) | class TsCatArrayEncoder(BaseEncoder):
    method __init__ (line 11) | def __init__(self, timesteps: int, is_target: bool = False, grouped_by...
    method prepare (line 25) | def prepare(self, priming_data):
    method encode (line 37) | def encode(self, data: Iterable[Iterable], dependency_data: Optional[D...
    method encode_one (line 55) | def encode_one(self, data: Iterable) -> torch.Tensor:
    method decode (line 77) | def decode(self, encoded_values, dependency_data=None) -> List[List]:
    method decode_one (line 99) | def decode_one(self, encoded_value) -> List:

FILE: lightwood/encoder/array/ts_num_array.py
  class TsArrayNumericEncoder (line 9) | class TsArrayNumericEncoder(BaseEncoder):
    method __init__ (line 10) | def __init__(self, timesteps: int, is_target: bool = False, positive_d...
    method prepare (line 29) | def prepare(self, priming_data):
    method encode (line 39) | def encode(self, data: Iterable[Iterable], dependency_data: Optional[D...
    method decode (line 59) | def decode(self, encoded_values, dependency_data=None) -> List[List]:
    method decode_one (line 81) | def decode_one(self, encoded_value, dependency_data={}) -> List:

FILE: lightwood/encoder/audio/mfcc.py
  class MFCCEncoder (line 10) | class MFCCEncoder(BaseEncoder):
    method __init__ (line 13) | def __init__(self, is_target: bool = False):
    method prepare (line 29) | def prepare(self, priming_data: Iterable[str]):
    method encode (line 40) | def encode(self, column_data: Iterable[str]) -> torch.Tensor:
    method decode (line 79) | def decode(self, _):

FILE: lightwood/encoder/base.py
  class BaseEncoder (line 5) | class BaseEncoder:
    method __init__ (line 35) | def __init__(self, is_target=False) -> None:
    method prepare (line 42) | def prepare(self, priming_data: Iterable[object]) -> None:
    method encode (line 50) | def encode(self, column_data: Iterable[object]) -> torch.Tensor:
    method decode (line 62) | def decode(self, encoded_data: torch.Tensor) -> List[object]:
    method to (line 73) | def to(self, device, available_devices):

FILE: lightwood/encoder/categorical/autoencoder.py
  class CategoricalAutoEncoder (line 17) | class CategoricalAutoEncoder(BaseEncoder):
    method __init__ (line 27) | def __init__(
    method prepare (line 64) | def prepare(self, train_priming_data: pd.Series, dev_priming_data: pd....
    method encode (line 115) | def encode(self, column_data: Iterable[str]) -> torch.Tensor:
    method decode (line 132) | def decode(self, encoded_data: torch.Tensor) -> List[str]:
    method _prepare_AE_input (line 148) | def _prepare_AE_input(
    method _prepare_catae (line 183) | def _prepare_catae(self, train_loader: DataLoader, dev_loader: DataLoa...
    method _encoder_targets (line 236) | def _encoder_targets(self, data):
    method _label_targets (line 245) | def _label_targets(self, data):

FILE: lightwood/encoder/categorical/binary.py
  class BinaryEncoder (line 13) | class BinaryEncoder(BaseEncoder):
    method __init__ (line 34) | def __init__(
    method prepare (line 59) | def prepare(self, priming_data: Iterable[str]):
    method encode (line 100) | def encode(self, column_data: Iterable[str]) -> torch.Tensor:
    method decode (line 124) | def decode(self, encoded_data: torch.Tensor):
    method decode_probabilities (line 150) | def decode_probabilities(self, encoded_data: torch.Tensor) -> Tuple[Li...
    method _norm_vec (line 173) | def _norm_vec(vec: List[float]):

FILE: lightwood/encoder/categorical/gym.py
  class Gym (line 10) | class Gym:
    method __init__ (line 12) | def __init__(self, model, optimizer, scheduler, loss_criterion, device,
    method fit (line 28) | def fit(self, train_data_loader, test_data_loader, desired_error, max_...

FILE: lightwood/encoder/categorical/multihot.py
  class MultiHotEncoder (line 7) | class MultiHotEncoder(BaseEncoder):
    method __init__ (line 8) | def __init__(self, is_target: bool = False):
    method _clean_col_data (line 15) | def _clean_col_data(column_data):
    method prepare (line 20) | def prepare(self, priming_data, max_dimensions=100):
    method encode (line 29) | def encode(self, column_data):
    method decode (line 34) | def decode(self, vectors):

FILE: lightwood/encoder/categorical/onehot.py
  class OneHotEncoder (line 13) | class OneHotEncoder(BaseEncoder):
    method __init__ (line 36) | def __init__(
    method prepare (line 59) | def prepare(self, priming_data: Iterable[str]):
    method encode (line 108) | def encode(self, column_data: Iterable[str]) -> torch.Tensor:
    method decode (line 133) | def decode(self, encoded_data: torch.Tensor):
    method decode_probabilities (line 153) | def decode_probabilities(self, encoded_data: torch.Tensor) -> Tuple[Li...
    method _norm_vec (line 176) | def _norm_vec(vec: List[float]):

FILE: lightwood/encoder/categorical/simple_label.py
  class SimpleLabelEncoder (line 11) | class SimpleLabelEncoder(BaseEncoder):
    method __init__ (line 20) | def __init__(self, is_target=False, normalize=True) -> None:
    method prepare (line 28) | def prepare(self, priming_data: Union[list, pd.Series]) -> None:
    method encode (line 40) | def encode(self, data: Union[tuple, np.ndarray, pd.Series], normalize=...
    method decode (line 59) | def decode(self, encoded_values: torch.Tensor, normalize=True) -> List...

FILE: lightwood/encoder/datetime/datetime.py
  class DatetimeEncoder (line 11) | class DatetimeEncoder(BaseEncoder):
    method __init__ (line 17) | def __init__(self, is_target: bool = False):
    method prepare (line 29) | def prepare(self, priming_data):
    method encode (line 32) | def encode(self, data: Union[np.ndarray, pd.Series]) -> torch.Tensor:
    method decode (line 55) | def decode(self, encoded_data: torch.Tensor, return_as_datetime=False)...

FILE: lightwood/encoder/datetime/datetime_sin_normalizer.py
  class DatetimeNormalizerEncoder (line 11) | class DatetimeNormalizerEncoder(BaseEncoder):
    method __init__ (line 12) | def __init__(self, is_target: bool = False, sinusoidal: bool = False):
    method prepare (line 23) | def prepare(self, priming_data):
    method encode (line 29) | def encode(self, data):
    method encode_one (line 46) | def encode_one(self, data):
    method decode (line 75) | def decode(self, encoded_data, return_as_datetime=False):
    method decode_one (line 85) | def decode_one(self, vector, return_as_datetime=False):

FILE: lightwood/encoder/helpers.py
  class MinMaxNormalizer (line 6) | class MinMaxNormalizer:
    method __init__ (line 7) | def __init__(self, combination=()):
    method prepare (line 13) | def prepare(self, x: np.ndarray) -> None:
    method encode (line 27) | def encode(self, y: np.ndarray) -> torch.Tensor:
    method decode (line 40) | def decode(self, y):
  class CatNormalizer (line 44) | class CatNormalizer:
    method __init__ (line 45) | def __init__(self, encoder_class='one_hot'):
    method prepare (line 54) | def prepare(self, x):
    method encode (line 62) | def encode(self, Y):
    method decode (line 75) | def decode(self, y):

FILE: lightwood/encoder/identity/identity.py
  class IdentityEncoder (line 7) | class IdentityEncoder(BaseEncoder):
    method __init__ (line 23) | def __init__(self, is_target: bool = False, handle_nan: bool = True) -...
    method prepare (line 32) | def prepare(self, priming_data: Iterable[Union[float, int]]) -> None:
    method encode (line 36) | def encode(self, column_data: Iterable[Union[float, int]]) -> torch.Te...
    method decode (line 53) | def decode(self, encoded_data: torch.Tensor) -> List[object]:

FILE: lightwood/encoder/image/helpers/img_to_vec.py
  class ChannelPoolAdaptiveAvg1d (line 14) | class ChannelPoolAdaptiveAvg1d(torch.nn.AdaptiveAvgPool1d):
    method forward (line 18) | def forward(self, input):
  class Img2Vec (line 28) | class Img2Vec(nn.Module):
    method __init__ (line 36) | def __init__(self, device=''):
    method to (line 46) | def to(self, device, available_devices=1):
    method forward (line 51) | def forward(self, image, batch=True):

FILE: lightwood/encoder/image/img_2_vec.py
  class Img2VecEncoder (line 14) | class Img2VecEncoder(BaseEncoder):
    method __init__ (line 27) | def __init__(
    method prepare (line 67) | def prepare(self, train_priming_data: Iterable[str], dev_priming_data:...
    method to (line 81) | def to(self, device, available_devices=1):
    method encode (line 93) | def encode(self, images: List[str]) -> torch.Tensor:
    method decode (line 115) | def decode(self, encoded_values_tensor: torch.Tensor):

FILE: lightwood/encoder/numeric/numeric.py
  class NumericEncoder (line 14) | class NumericEncoder(BaseEncoder):
    method __init__ (line 26) | def __init__(self, data_type: dtype = None,
    method prepare (line 49) | def prepare(self, priming_data: pd.Series):
    method encode (line 61) | def encode(self, data: Union[np.ndarray, pd.Series]):
    method _sign_fn (line 93) | def _sign_fn(x: float) -> float:
    method _log_fn (line 97) | def _log_fn(x: float) -> float:
    method _norm_fn (line 100) | def _norm_fn(self, x: float) -> float:
    method _none_fn (line 104) | def _none_fn(x: float) -> float:
    method decode (line 107) | def decode(self, encoded_values: torch.Tensor, decode_log: bool = None...
    method get_weights (line 162) | def get_weights(self, label_data):

FILE: lightwood/encoder/numeric/ts_numeric.py
  class TsNumericEncoder (line 10) | class TsNumericEncoder(NumericEncoder):
    method __init__ (line 16) | def __init__(self, is_target: bool = False, positive_domain: bool = Fa...
    method encode (line 24) | def encode(self, data: Union[np.ndarray, pd.Series], dependency_data: ...
    method decode (line 73) | def decode(self, encoded_values: torch.Tensor, decode_log: bool = None...

FILE: lightwood/encoder/text/helpers/pretrained_helpers.py
  class TextEmbed (line 9) | class TextEmbed(torch.utils.data.Dataset):
    method __init__ (line 17) | def __init__(self, encodings, labels):
    method __getitem__ (line 21) | def __getitem__(self, idx):
    method __len__ (line 26) | def __len__(self):

FILE: lightwood/encoder/text/pretrained.py
  class PretrainedLangEncoder (line 28) | class PretrainedLangEncoder(BaseEncoder):
    method __init__ (line 38) | def __init__(
    method prepare (line 95) | def prepare(
    method _tune_model (line 229) | def _tune_model(self, train_dataset, val_dataset, optim, scheduler, n_...
    method _call (line 315) | def _call(self, batch):
    method _train_callback (line 323) | def _train_callback(self, epoch, loss):
    method encode (line 326) | def encode(self, column_data: Iterable[str]) -> torch.Tensor:
    method decode (line 375) | def decode(self, encoded_values_tensor, max_length=100):
    method to (line 381) | def to(self, device, available_devices):

FILE: lightwood/encoder/text/short.py
  class ShortTextEncoder (line 10) | class ShortTextEncoder(BaseEncoder):
    method __init__ (line 13) | def __init__(self, is_target=False, mode=None, device=''):
    method _unexpected_mode (line 45) | def _unexpected_mode(self):
    method _combine_concat (line 49) | def _combine_concat(self, vecs):
    method _combine_mean (line 52) | def _combine_mean(self, vecs):
    method prepare (line 55) | def prepare(self, priming_data):
    method encode (line 79) | def encode(self, column_data: List[str]) -> torch.Tensor:
    method decode (line 90) | def decode(self, vectors):

FILE: lightwood/encoder/text/tfidf.py
  class TfidfEncoder (line 8) | class TfidfEncoder(BaseEncoder):
    method __init__ (line 9) | def __init__(self, is_target: bool = False):
    method prepare (line 14) | def prepare(self, priming_data, training_data=None):
    method encode (line 18) | def encode(self, column_data):
    method decode (line 23) | def decode(self, encoded_values_tensor):

FILE: lightwood/encoder/text/vocab.py
  class VocabularyEncoder (line 7) | class VocabularyEncoder(BaseEncoder):
    method __init__ (line 8) | def __init__(self, is_target: bool = False):
    method prepare (line 16) | def prepare(self, priming_data):
    method encode (line 22) | def encode(self, column_data):
    method decode (line 30) | def decode(self, encoded_values_tensor):

FILE: lightwood/encoder/time_series/helpers/common.py
  function generate_target_group_normalizers (line 9) | def generate_target_group_normalizers(

FILE: lightwood/encoder/time_series/helpers/rnn_helpers.py
  class DecoderRNNNumerical (line 7) | class DecoderRNNNumerical(nn.Module):
    method __init__ (line 8) | def __init__(self, hidden_size, output_size):
    method forward (line 16) | def forward(self, input, hidden):
    method init_hidden (line 24) | def init_hidden(self, device, batch_size=1):
    method decode (line 27) | def decode(self, data, initial_tensor, criterion, device, hidden_state...
  class EncoderRNNNumerical (line 57) | class EncoderRNNNumerical(nn.Module):
    method __init__ (line 58) | def __init__(self, input_size, hidden_size):
    method forward (line 65) | def forward(self, input, hidden):
    method init_hidden (line 72) | def init_hidden(self, device, batch_size=1):
    method bptt (line 75) | def bptt(self, data, criterion, device):

FILE: lightwood/encoder/time_series/helpers/transformer_helpers.py
  function len_to_mask (line 7) | def len_to_mask(lengths, zeros):
  function get_chunk (line 22) | def get_chunk(source, source_lengths, start, step):
  class PositionalEncoding (line 36) | class PositionalEncoding(nn.Module):
    method __init__ (line 37) | def __init__(self, d_model, dropout=0.2, max_len=5000):
    method forward (line 54) | def forward(self, x):
  class TransformerEncoder (line 60) | class TransformerEncoder(nn.Module):
    method __init__ (line 61) | def __init__(self, ninp, nhead, nhid, nlayers, dropout=0.2):
    method _generate_square_subsequent_mask (line 71) | def _generate_square_subsequent_mask(self, sz):
    method init_weights (line 81) | def init_weights(self):
    method forward (line 86) | def forward(self, src, lengths, device):
    method bptt (line 103) | def bptt(self, batch, criterion, device):

FILE: lightwood/encoder/time_series/rnn.py
  class TimeSeriesEncoder (line 23) | class TimeSeriesEncoder(BaseEncoder):
    method __init__ (line 32) | def __init__(self,
    method setup_nn (line 68) | def setup_nn(self, ts_analysis, dependencies=None):
    method to (line 135) | def to(self, device, available_devices):
    method _prepare_raw_data (line 141) | def _prepare_raw_data(self, data):
    method _get_batch (line 156) | def _get_batch(self, source, start, end):
    method prepare (line 160) | def prepare(self, train_priming_data: pd.Series, dev_priming_data: pd....
    method _encode_one (line 288) | def _encode_one(self, data, previous=None, initial_hidden=None, return...
    method encode (line 346) | def encode(self, column_data, dependency_data=None, get_next_count=None):
    method _decode_one (line 446) | def _decode_one(self, hidden, steps):
    method decode (line 463) | def decode(self, encoded_data, steps=None):
    method _masked_criterion (line 489) | def _masked_criterion(self, output, targets, lengths):

FILE: lightwood/encoder/time_series/ts.py
  class TimeSeriesEncoder (line 9) | class TimeSeriesEncoder(ArrayEncoder):
    method __init__ (line 13) | def __init__(self, stop_after: float, window: int = None, is_target: b...
    method encode (line 24) | def encode(self, column_data: Iterable[Iterable]) -> torch.Tensor:
    method decode (line 46) | def decode(self, data: torch.Tensor) -> List[Iterable]:

FILE: lightwood/ensemble/base.py
  class BaseEnsemble (line 10) | class BaseEnsemble:
    method __init__ (line 33) | def __init__(self, target, mixers: List[BaseMixer], data: EncodedDs, f...
    method __call__ (line 40) | def __call__(self, ds: EncodedDs, args: PredictionArguments) -> pd.Dat...

FILE: lightwood/ensemble/best_of.py
  class BestOf (line 15) | class BestOf(BaseEnsemble):
    method __init__ (line 22) | def __init__(self, target, mixers: List[BaseMixer], data: EncodedDs, a...
    method __call__ (line 58) | def __call__(self, ds: EncodedDs, args: PredictionArguments) -> pd.Dat...

FILE: lightwood/ensemble/embed.py
  class Embedder (line 10) | class Embedder(BaseEnsemble):
    method __init__ (line 15) | def __init__(self, target, mixers: List[BaseMixer], data: EncodedDs) -...
    method __call__ (line 20) | def __call__(self, ds: EncodedDs, args: PredictionArguments = None) ->...

FILE: lightwood/ensemble/identity.py
  class IdentityEnsemble (line 10) | class IdentityEnsemble(BaseEnsemble):
    method __init__ (line 17) | def __init__(self, target, mixers: List[BaseMixer], data: EncodedDs, a...
    method __call__ (line 24) | def __call__(self, ds: EncodedDs, args: PredictionArguments = None) ->...
    method active_mixer (line 30) | def active_mixer(self):
    method active_mixer (line 34) | def active_mixer(self, idx):

FILE: lightwood/ensemble/mean_ensemble.py
  class MeanEnsemble (line 12) | class MeanEnsemble(BaseEnsemble):
    method __init__ (line 18) | def __init__(self, target, mixers: List[BaseMixer], data: EncodedDs, d...
    method __call__ (line 25) | def __call__(self, ds: EncodedDs, args: PredictionArguments) -> pd.Dat...

FILE: lightwood/ensemble/mode_ensemble.py
  class ModeEnsemble (line 16) | class ModeEnsemble(BaseEnsemble):
    method __init__ (line 26) | def __init__(self, target, mixers: List[BaseMixer], data: EncodedDs, d...
    method _pick_mode_highest_score (line 56) | def _pick_mode_highest_score(self, prediction: pd.Series):
    method __call__ (line 83) | def __call__(self, ds: EncodedDs, args: PredictionArguments) -> pd.Dat...

FILE: lightwood/ensemble/stacked_ensemble.py
  class StackedEnsemble (line 17) | class StackedEnsemble(MeanEnsemble):
    method __init__ (line 30) | def __init__(self, target, mixers: List[BaseMixer], data: EncodedDs, d...
    method predict (line 57) | def predict(self, ds: EncodedDs, args: PredictionArguments) -> List:
    method __call__ (line 65) | def __call__(self, ds: EncodedDs, args: PredictionArguments) -> pd.Dat...
    method set_weights (line 74) | def set_weights(self, weights: List):

FILE: lightwood/ensemble/ts_stacked_ensemble.py
  class TsStackedEnsemble (line 19) | class TsStackedEnsemble(StackedEnsemble):
    method __init__ (line 23) | def __init__(self, target, mixers: List[BaseMixer], data: EncodedDs, d...
    method __call__ (line 59) | def __call__(self, ds: EncodedDs, args: PredictionArguments) -> pd.Dat...

FILE: lightwood/ensemble/weighted_mean_ensemble.py
  class WeightedMeanEnsemble (line 16) | class WeightedMeanEnsemble(BaseEnsemble):
    method __init__ (line 26) | def __init__(self, target, mixers: List[BaseMixer], data: EncodedDs, a...
    method __call__ (line 58) | def __call__(self, ds: EncodedDs, args: PredictionArguments) -> pd.Dat...
    method accuracies_to_weights (line 69) | def accuracies_to_weights(x: np.array) -> np.array:

FILE: lightwood/helpers/codegen.py
  function code_from_json_ai (line 22) | def code_from_json_ai(json_ai: JsonAI) -> str:
  function _module_from_code (line 673) | def _module_from_code(code: str, module_name: str) -> ModuleType:
  function _predictor_from_code (line 698) | def _predictor_from_code(code: str):

FILE: lightwood/helpers/device.py
  function is_cuda_compatible (line 7) | def is_cuda_compatible():
  function get_devices (line 43) | def get_devices():
  function get_device_from_name (line 80) | def get_device_from_name(device_name=''):

FILE: lightwood/helpers/general.py
  function is_none (line 7) | def is_none(value):

FILE: lightwood/helpers/io.py
  function read_from_path_or_url (line 6) | def read_from_path_or_url(path: str, load_from_path):

FILE: lightwood/helpers/log.py
  function initialize_log (line 9) | def initialize_log():
  function timed_predictor (line 22) | def timed_predictor(f):
  function timed (line 39) | def timed(f):

FILE: lightwood/helpers/numeric.py
  function filter_nan_and_none (line 5) | def filter_nan_and_none(series: Iterable) -> list:

FILE: lightwood/helpers/parallelism.py
  function get_nr_procs (line 11) | def get_nr_procs(df=None):
  function run_mut_method (line 31) | def run_mut_method(obj: object, arg: object, method: str, identifier: st...
  function mut_method_call (line 40) | def mut_method_call(object_dict: Dict[str, tuple]) -> Dict[str, object]:
  function parallel_encoding_check (line 62) | def parallel_encoding_check(df, encoders):

FILE: lightwood/helpers/seed.py
  function seed (line 6) | def seed(seed_nr: int) -> None:

FILE: lightwood/helpers/templating.py
  function is_allowed (line 11) | def is_allowed(v):
  function call (line 34) | def call(entity: dict) -> str:
  function inline_dict (line 55) | def inline_dict(obj: dict) -> str:
  function align (line 67) | def align(code: str, indent: int) -> str:
  function _consolidate_analysis_blocks (line 77) | def _consolidate_analysis_blocks(jsonai, key):
  function _add_cls_kwarg (line 151) | def _add_cls_kwarg(cls: Callable, kwargs: dict, key: str, value):

FILE: lightwood/helpers/torch.py
  function concat_vectors_and_pad (line 7) | def concat_vectors_and_pad(vec_list, max_):
  function average_vectors (line 47) | def average_vectors(vec_list):
  class LightwoodAutocast (line 52) | class LightwoodAutocast:
    method __init__ (line 72) | def __init__(self, enabled=True):
    method __enter__ (line 97) | def __enter__(self):
    method __exit__ (line 106) | def __exit__(self, *args):
    method __call__ (line 117) | def __call__(self, func):

FILE: lightwood/helpers/ts.py
  function get_ts_groups (line 8) | def get_ts_groups(df: pd.DataFrame, tss) -> list:
  function get_delta (line 17) | def get_delta(df: pd.DataFrame, tss) -> Tuple[Dict, Dict, Dict]:
  function get_inferred_timestamps (line 51) | def get_inferred_timestamps(df: pd.DataFrame, col: str, deltas: dict, ts...
  function add_tn_num_conf_bounds (line 92) | def add_tn_num_conf_bounds(data: pd.DataFrame, tss_args):
  function add_tn_cat_conf_bounds (line 117) | def add_tn_cat_conf_bounds(data: pd.DataFrame, tss_args):
  class Differencer (line 124) | class Differencer:
    method __init__ (line 125) | def __init__(self):
    method diff (line 131) | def diff(self, series: np.array) -> pd.Series:
    method fit (line 136) | def fit(self, series: np.array) -> None:
    method transform (line 143) | def transform(self, series: np.array) -> pd.Series:
    method inverse_transform (line 147) | def inverse_transform(self, series: pd.Series, init=None) -> pd.Series:
    method _flatten_series (line 154) | def _flatten_series(series: np.ndarray) -> np.ndarray:
  function detect_freq_period (line 162) | def detect_freq_period(deltas: pd.DataFrame, tss, n_points) -> tuple:
  function freq_to_pandas (line 218) | def freq_to_pandas(freq, multiplier=1):
  function filter_ts (line 243) | def filter_ts(df: pd.DataFrame, tss, n_rows=1):

FILE: lightwood/mixer/arima.py
  class ARIMAMixer (line 6) | class ARIMAMixer(SkTime):
    method __init__ (line 7) | def __init__(

FILE: lightwood/mixer/base.py
  class BaseMixer (line 8) | class BaseMixer:
    method __init__ (line 31) | def __init__(self, stop_after: float):
    method fit (line 39) | def fit(self, train_data: EncodedDs, dev_data: EncodedDs) -> None:
    method __call__ (line 49) | def __call__(self, ds: EncodedDs,
    method partial_fit (line 60) | def partial_fit(self, train_data: EncodedDs, dev_data: EncodedDs, adju...

FILE: lightwood/mixer/ets.py
  class ETSMixer (line 6) | class ETSMixer(SkTime):
    method __init__ (line 7) | def __init__(

FILE: lightwood/mixer/helpers/ar_net.py
  class ArNet (line 7) | class ArNet(DefaultNet):
    method __init__ (line 13) | def __init__(self,
    method to (line 40) | def to(self, device=None, available_devices=None):
    method forward (line 45) | def forward(self, input):

FILE: lightwood/mixer/helpers/default_net.py
  class DefaultNet (line 9) | class DefaultNet(torch.nn.Module):
    method __init__ (line 16) | def __init__(self,
    method to (line 57) | def to(self, device: torch.device) -> torch.nn.Module:
    method forward (line 66) | def forward(self, input):

FILE: lightwood/mixer/helpers/qclassic_net.py
  class QuantumCircuit (line 9) | class QuantumCircuit:
    method __init__ (line 15) | def __init__(self, n_qubits, backend, shots):
    method run (line 32) | def run(self, thetas):
  class HybridSingleFunction (line 51) | class HybridSingleFunction(torch.autograd.Function):
    method forward (line 55) | def forward(ctx, input, quantum_circuit, shift):
    method backward (line 67) | def backward(ctx, grad_output):
  class HybridSingle (line 82) | class HybridSingle(torch.nn.Module):
    method __init__ (line 85) | def __init__(self, backend, shots, shift):
    method forward (line 90) | def forward(self, input):
  class QClassicNet (line 94) | class QClassicNet(DefaultNet):
    method __init__ (line 99) | def __init__(self,
    method to (line 116) | def to(self, device=None, available_devices=None):
    method forward (line 119) | def forward(self, input):

FILE: lightwood/mixer/helpers/ranger.py
  class Ranger (line 6) | class Ranger(Optimizer):
    method __init__ (line 7) | def __init__(
    method __setstate__ (line 46) | def __setstate__(self, state):
    method step (line 49) | def step(self, closure=None):

FILE: lightwood/mixer/helpers/residual_net.py
  class ResidualModule (line 9) | class ResidualModule(nn.Module):
    method __init__ (line 10) | def __init__(
    method forward (line 23) | def forward(self, x: torch.Tensor) -> torch.Tensor:
  class ResidualNet (line 36) | class ResidualNet(torch.nn.Module):
    method __init__ (line 37) | def __init__(self,
    method to (line 58) | def to(self, device: torch.device, available_devices: int = 1) -> torc...
    method forward (line 69) | def forward(self, input):

FILE: lightwood/mixer/helpers/transform_corss_entropy_loss.py
  class TransformCrossEntropyLoss (line 6) | class TransformCrossEntropyLoss(torch.nn.Module):
    method __init__ (line 7) | def __init__(self, **kwargs):
    method forward (line 11) | def forward(self, preds, target):

FILE: lightwood/mixer/helpers/ts.py
  function _transform_target (line 10) | def _transform_target(ts_analysis: Dict[str, Dict], df: pd.DataFrame, fr...
  function _inverse_transform_target (line 25) | def _inverse_transform_target(ts_analysis: Dict[str, Dict], predictions:...

FILE: lightwood/mixer/lightgbm.py
  function check_gpu_support (line 22) | def check_gpu_support():
  class LightGBM (line 38) | class LightGBM(BaseMixer):
    method __init__ (line 62) | def __init__(
    method _to_dataset (line 105) | def _to_dataset(self, data: Dict[str, Dict], output_dtype: str):
    method fit (line 156) | def fit(self, train_data: EncodedDs, dev_data: EncodedDs) -> None:
    method partial_fit (line 271) | def partial_fit(self, train_data: EncodedDs, dev_data: EncodedDs, args...
    method __call__ (line 305) | def __call__(self, ds: EncodedDs,

FILE: lightwood/mixer/lightgbm_array.py
  class LightGBMArray (line 15) | class LightGBMArray(BaseMixer):
    method __init__ (line 24) | def __init__(
    method _fit (line 59) | def _fit(self, train_data: EncodedDs, dev_data: EncodedDs, submodel_me...
    method fit (line 70) | def fit(self, train_data: EncodedDs, dev_data: EncodedDs) -> None:
    method partial_fit (line 74) | def partial_fit(self, train_data: EncodedDs, dev_data: EncodedDs, args...
    method __call__ (line 78) | def __call__(self, ds: Union[EncodedDs, ConcatedEncodedDs],

FILE: lightwood/mixer/neural.py
  class Neural (line 29) | class Neural(BaseMixer):
    method __init__ (line 37) | def __init__(
    method _final_tuning (line 77) | def _final_tuning(self, data):
    method _select_criterion (line 101) | def _select_criterion(self) -> torch.nn.Module:
    method _select_optimizer (line 115) | def _select_optimizer(self, model, lr) -> Optimizer:
    method _find_lr (line 119) | def _find_lr(self, train_data):
    method _max_fit (line 186) | def _max_fit(self, train_dl, dev_dl, criterion, optimizer, scaler, sto...
    method _error (line 251) | def _error(self, dev_dl, criterion) -> float:
    method _init_net (line 262) | def _init_net(self, ds: EncodedDs):
    method _net_call (line 277) | def _net_call(self, x: torch.Tensor) -> torch.Tensor:
    method _fit (line 282) | def _fit(self, train_data: EncodedDs, dev_data: EncodedDs) -> None:
    method fit (line 317) | def fit(self, train_data: EncodedDs, dev_data: EncodedDs) -> None:
    method partial_fit (line 321) | def partial_fit(self, train_data: EncodedDs, dev_data: EncodedDs, args...
    method __call__ (line 340) | def __call__(self, ds: EncodedDs,

FILE: lightwood/mixer/neural_ts.py
  class NeuralTs (line 23) | class NeuralTs(Neural):
    method __init__ (line 24) | def __init__(
    method _select_criterion (line 68) | def _select_criterion(self) -> torch.nn.Module:
    method _fit (line 76) | def _fit(self, train_data: EncodedDs, dev_data: EncodedDs) -> None:
    method fit (line 119) | def fit(self, train_data: EncodedDs, dev_data: EncodedDs) -> None:
    method __call__ (line 122) | def __call__(self, ds: EncodedDs,

FILE: lightwood/mixer/nhits.py
  class NHitsMixer (line 17) | class NHitsMixer(BaseMixer):
    method __init__ (line 26) | def __init__(
    method fit (line 101) | def fit(self, train_data: EncodedDs, dev_data: EncodedDs) -> None:
    method partial_fit (line 149) | def partial_fit(self, train_data: EncodedDs, dev_data: EncodedDs, args...
    method __call__ (line 155) | def __call__(self, ds: Union[EncodedDs, ConcatedEncodedDs],
    method _make_initial_df (line 209) | def _make_initial_df(self, df, mode='inference'):
    method _set_boundary (line 249) | def _set_boundary(df: pd.DataFrame, gby: list) -> Dict[str, object]:

FILE: lightwood/mixer/prophet.py
  class ProphetMixer (line 8) | class ProphetMixer(SkTime):
    method __init__ (line 9) | def __init__(

FILE: lightwood/mixer/qclassic.py
  class QClassic (line 9) | class QClassic(Neural):
    method __init__ (line 11) | def __init__(

FILE: lightwood/mixer/random_forest.py
  class RandomForest (line 22) | class RandomForest(BaseMixer):
    method __init__ (line 30) | def __init__(
    method _multi_logloss (line 72) | def _multi_logloss(self, y_true: np.ndarray, y_pred: np.ndarray, eps: ...
    method fit (line 79) | def fit(self, train_data: EncodedDs, dev_data: EncodedDs) -> None:
    method partial_fit (line 180) | def partial_fit(self, train_data: EncodedDs, dev_data: EncodedDs, args...
    method __call__ (line 190) | def __call__(self, ds: EncodedDs,

FILE: lightwood/mixer/regression.py
  class Regression (line 15) | class Regression(BaseMixer):
    method __init__ (line 30) | def __init__(self, stop_after: float, target_encoder: BaseEncoder, dty...
    method fit (line 45) | def fit(self, train_data: EncodedDs, dev_data: EncodedDs) -> None:
    method partial_fit (line 71) | def partial_fit(self, train_data: EncodedDs, dev_data: EncodedDs, args...
    method __call__ (line 80) | def __call__(self, ds: EncodedDs,

FILE: lightwood/mixer/sktime.py
  class SkTime (line 21) | class SkTime(BaseMixer):
    method __init__ (line 29) | def __init__(
    method fit (line 114) | def fit(self, train_data: EncodedDs, dev_data: EncodedDs) -> None:
    method _fit (line 133) | def _fit(self, data):
    method partial_fit (line 218) | def partial_fit(self, train_data: EncodedDs, dev_data: EncodedDs, args...
    method __call__ (line 233) | def __call__(self, ds: Union[EncodedDs, ConcatedEncodedDs],
    method _call_groupmodel (line 279) | def _call_groupmodel(self,
    method _call_default (line 314) | def _call_default(self, ydf, data, idxs):
    method _get_best_model (line 321) | def _get_best_model(self, trial, train_data, test_data):
    method _transform_index_to_datetime (line 343) | def _transform_index_to_datetime(self, series, series_oby, freq):
    method _get_freq (line 349) | def _get_freq(self, delta):

FILE: lightwood/mixer/tabtransformer.py
  class TabTransformerMixer (line 12) | class TabTransformerMixer(Neural):
    method __init__ (line 13) | def __init__(
    method _init_net (line 42) | def _init_net(self, ds: EncodedDs):
    method _net_call (line 59) | def _net_call(self, x: torch.Tensor) -> torch.Tensor:
    method fit (line 63) | def fit(self, train_data: EncodedDs, dev_data: EncodedDs) -> None:

FILE: lightwood/mixer/unit.py
  class Unit (line 13) | class Unit(BaseMixer):
    method __init__ (line 14) | def __init__(self, stop_after: float, target_encoder: BaseEncoder):
    method fit (line 30) | def fit(self, train_data: EncodedDs, dev_data: EncodedDs) -> None:
    method partial_fit (line 33) | def partial_fit(self, train_data: EncodedDs, dev_data: EncodedDs, args...
    method __call__ (line 36) | def __call__(self, ds: EncodedDs,

FILE: lightwood/mixer/xgboost.py
  function check_gpu_support (line 22) | def check_gpu_support():
  class XGBoostMixer (line 36) | class XGBoostMixer(BaseMixer):
    method __init__ (line 63) | def __init__(
    method _to_dataset (line 104) | def _to_dataset(self, ds: EncodedDs, output_dtype: str, mode='train'):
    method fit (line 159) | def fit(self, train_data: EncodedDs, dev_data: EncodedDs) -> None:
    method partial_fit (line 257) | def partial_fit(self, train_data: EncodedDs, dev_data: EncodedDs, args...
    method __call__ (line 262) | def __call__(self, ds: EncodedDs,

FILE: lightwood/mixer/xgboost_array.py
  class XGBoostArrayMixer (line 15) | class XGBoostArrayMixer(BaseMixer):
    method __init__ (line 24) | def __init__(
    method _fit (line 59) | def _fit(self, train_data: EncodedDs, dev_data: EncodedDs, submodel_me...
    method fit (line 70) | def fit(self, train_data: EncodedDs, dev_data: EncodedDs) -> None:
    method partial_fit (line 74) | def partial_fit(self, train_data: EncodedDs, dev_data: EncodedDs, args...
    method __call__ (line 78) | def __call__(self, ds: Union[EncodedDs, ConcatedEncodedDs],

FILE: tests/integration/advanced/test_array.py
  class TestArrayTarget (line 10) | class TestArrayTarget(unittest.TestCase):
    method _test_array (line 11) | def _test_array(self, df):
    method test_0_num_array (line 23) | def test_0_num_array(self):
    method test_1_cat_array (line 32) | def test_1_cat_array(self):

FILE: tests/integration/advanced/test_custom_modules.py
  function create_custom_module (line 12) | def create_custom_module(module_name, module_code):
  class TestBasic (line 20) | class TestBasic(unittest.TestCase):
    method test_0_add_throwing_cleaner (line 21) | def test_0_add_throwing_cleaner(self):
    method test_1_add_analyzer_block (line 57) | def test_1_add_analyzer_block(self):

FILE: tests/integration/advanced/test_text_input.py
  class TestText (line 7) | class TestText(unittest.TestCase):
    method test_0_train_and_predict_bypass (line 8) | def test_0_train_and_predict_bypass(self):
    method test_1_train_and_predict_model (line 21) | def test_1_train_and_predict_model(self):

FILE: tests/integration/advanced/test_timeseries.py
  class TestTimeseries (line 22) | class TestTimeseries(unittest.TestCase):
    method check_ts_prediction_df (line 23) | def check_ts_prediction_df(self, df: pd.DataFrame, horizon: int, order...
    method split_arrivals (line 39) | def split_arrivals(self, data: pd.DataFrame, grouped: bool) -> (pd.Dat...
    method test_0_time_series_grouped_regression (line 57) | def test_0_time_series_grouped_regression(self):
    method test_1_time_series_regression (line 127) | def test_1_time_series_regression(self):
    method test_2_time_series_classification_short_horizon_binary (line 186) | def test_2_time_series_classification_short_horizon_binary(self):
    method test_3_time_series_classification_long_horizon_binary (line 209) | def test_3_time_series_classification_long_horizon_binary(self):
    method test_4_time_series_classification_long_horizon_multiclass (line 233) | def test_4_time_series_classification_long_horizon_multiclass(self):
    method test_5_time_series_arima_mixer (line 261) | def test_5_time_series_arima_mixer(self):
    method test_6_time_series_sktime_mixer (line 353) | def test_6_time_series_sktime_mixer(self):
    method test_61_offset (line 387) | def test_61_offset(self):
    method test_7_irregular_series (line 432) | def test_7_irregular_series(self):
    method test_8_time_series_double_grouped_regression (line 472) | def test_8_time_series_double_grouped_regression(self):
    method test_9_ts_dedupe (line 505) | def test_9_ts_dedupe(self):
    method test_10_ts_stacked_ensemble (line 524) | def test_10_ts_stacked_ensemble(self):
    method test_11_output_date_format (line 553) | def test_11_output_date_format(self):

FILE: tests/integration/basic/test_airline.py
  class TestBasic (line 8) | class TestBasic(unittest.TestCase):
    method test_0_predict_file_flow (line 10) | def test_0_predict_file_flow(self):

FILE: tests/integration/basic/test_categorical.py
  class TestBasic (line 10) | class TestBasic(unittest.TestCase):
    method setup_predictor (line 11) | def setup_predictor(self, df, target):
    method test_0_binary (line 35) | def test_0_binary(self):
    method test_1_categorical (line 44) | def test_1_categorical(self):
    method test_2_binary_no_analysis (line 60) | def test_2_binary_no_analysis(self):

FILE: tests/integration/basic/test_cleaner.py
  class TestCleaner (line 10) | class TestCleaner(unittest.TestCase):
    method test_0_imputers (line 11) | def test_0_imputers(self):

FILE: tests/integration/basic/test_embedding.py
  class TestEmbeddingPredictor (line 8) | class TestEmbeddingPredictor(unittest.TestCase):
    method test_0_embedding_at_inference_time (line 9) | def test_0_embedding_at_inference_time(self):
    method test_1_embedding_only_at_creation (line 19) | def test_1_embedding_only_at_creation(self):

FILE: tests/integration/basic/test_ensembles.py
  class TestBasic (line 8) | class TestBasic(unittest.TestCase):
    method test_0_mean_ensemble (line 9) | def test_0_mean_ensemble(self):
    method test_1_mode_ensemble (line 35) | def test_1_mode_ensemble(self):
    method test_2_weighted_mean_ensemble (line 63) | def test_2_weighted_mean_ensemble(self):

FILE: tests/integration/basic/test_jsonai.py
  class TestJsonAI (line 7) | class TestJsonAI(unittest.TestCase):
    method test_0_hidden_args_analysis (line 8) | def test_0_hidden_args_analysis(self):
    method test_1_incorrect_chain (line 26) | def test_1_incorrect_chain(self):
    method test_2_tempscale_analysis (line 36) | def test_2_tempscale_analysis(self):

FILE: tests/integration/basic/test_model_selection.py
  class TestMixerSelection (line 8) | class TestMixerSelection(unittest.TestCase):
    method get_mixers (line 9) | def get_mixers(self, df: pd.DataFrame, target: str, prob_kwargs: dict ...
    method test_0_regression_task (line 16) | def test_0_regression_task(self):
    method test_1_multiclass_task (line 23) | def test_1_multiclass_task(self):
    method test_2_unit_text_task (line 30) | def test_2_unit_text_task(self):
    method test_3_complex_text_task (line 37) | def test_3_complex_text_task(self):
    method test_4_timeseries_t_plus_1 (line 44) | def test_4_timeseries_t_plus_1(self):
    method test_5_timeseries_t_plus_n (line 60) | def test_5_timeseries_t_plus_n(self):

FILE: tests/integration/basic/test_qclassic.py
  class TestBasic (line 10) | class TestBasic(unittest.TestCase):
    method test_0_predict_file_flow (line 11) | def test_0_predict_file_flow(self):

FILE: tests/integration/basic/test_regression.py
  class TestBasic (line 10) | class TestBasic(unittest.TestCase):
    method test_0_predict_file_flow (line 11) | def test_0_predict_file_flow(self):
    method test_1_stacked_ensemble (line 66) | def test_1_stacked_ensemble(self):

FILE: tests/integration/basic/test_save_and_load.py
  function save (line 9) | def save(predictor, path):
  function train (line 13) | def train(predictor, df):
  function execute_first_bit (line 17) | def execute_first_bit(code, df, path):
  function execute_second_bit (line 22) | def execute_second_bit(code, df, path):
  function execute_third_bit (line 30) | def execute_third_bit(code, df, path):
  class TestBasic (line 39) | class TestBasic(unittest.TestCase):
    method test_0_predict_file_flow (line 40) | def test_0_predict_file_flow(self):

FILE: tests/integration/basic/test_weird_target_dist.py
  class TestBasic (line 7) | class TestBasic(unittest.TestCase):
    method test_0_unknown_categories_in_test (line 8) | def test_0_unknown_categories_in_test(self):

FILE: tests/unit_tests/analysis/test_nc_norm.py
  class TestNcNormalizer (line 6) | class TestNcNormalizer(unittest.TestCase):
    method test_compute_numerical_labels (line 7) | def test_compute_numerical_labels(self):
    method test_compute_categorical_labels (line 18) | def test_compute_categorical_labels(self):

FILE: tests/unit_tests/analysis/test_pyod.py
  class TestPyOD (line 8) | class TestPyOD(unittest.TestCase):
    method test_0_pyod_analysis (line 9) | def test_0_pyod_analysis(self):

FILE: tests/unit_tests/analysis/test_shap.py
  class TestSHAP (line 8) | class TestSHAP(unittest.TestCase):
    method test_0_shap_analysis (line 9) | def test_0_shap_analysis(self):

FILE: tests/unit_tests/data/test_transform_ts.py
  class TestTransformTS (line 9) | class TestTransformTS(unittest.TestCase):
    method test_get_residuals (line 10) | def test_get_residuals(self):

FILE: tests/unit_tests/encoder/audio/test_mfcc.py
  class TestMFCCEncoder (line 9) | class TestMFCCEncoder(unittest.TestCase):
    method test_encode (line 10) | def test_encode(self):

FILE: tests/unit_tests/encoder/categorical/test_autoencoder.py
  class TestAutoencoder (line 16) | class TestAutoencoder(unittest.TestCase):
    method create_test_data (line 18) | def create_test_data(self,
    method test_autoencoder_ohe (line 44) | def test_autoencoder_ohe(self):
    method test_autoencoder_label (line 65) | def test_autoencoder_label(self):
    method check_encoder_on_device (line 84) | def check_encoder_on_device(self, device):
    method test_encoder_on_cpu (line 94) | def test_encoder_on_cpu(self):
    method test_encoder_on_cuda (line 98) | def test_encoder_on_cuda(self):

FILE: tests/unit_tests/encoder/categorical/test_binary.py
  class TestBinary (line 12) | class TestBinary(unittest.TestCase):
    method test_encode_decode_with_binary (line 15) | def test_encode_decode_with_binary(self):
    method test_check_only_binary (line 71) | def test_check_only_binary(self):
    method test_check_probabilities (line 78) | def test_check_probabilities(self):
    method test_target_distro_scaled_to_1 (line 96) | def test_target_distro_scaled_to_1(self):
    method test_distro_nonzeroweights (line 116) | def test_distro_nonzeroweights(self):
    method test_distro_zero (line 135) | def test_distro_zero(self):

FILE: tests/unit_tests/encoder/categorical/test_label.py
  class TestLabel (line 10) | class TestLabel(unittest.TestCase):
    method test_encode_and_decode (line 13) | def test_encode_and_decode(self):

FILE: tests/unit_tests/encoder/categorical/test_multihot.py
  class TestMultiHotEncoder (line 11) | class TestMultiHotEncoder(unittest.TestCase):
    method get_vocab (line 12) | def get_vocab(self):
    method test_multi_encoding (line 15) | def test_multi_encoding(self):
    method test_multi_encoding_empty_row (line 33) | def test_multi_encoding_empty_row(self):
    method test_handle_unseen_none (line 49) | def test_handle_unseen_none(self):

FILE: tests/unit_tests/encoder/categorical/test_onehot.py
  class TestOnehot (line 12) | class TestOnehot(unittest.TestCase):
    method test_encode_and_decode_with_unknown_token (line 20) | def test_encode_and_decode_with_unknown_token(self):
    method test_encode_and_decode_with_return_zeros (line 65) | def test_encode_and_decode_with_return_zeros(self):
    method test_check_probs_with_unknown (line 106) | def test_check_probs_with_unknown(self):
    method test_target_distro_scaled_to_1 (line 120) | def test_target_distro_scaled_to_1(self):
    method test_target_distro_with_unk (line 140) | def test_target_distro_with_unk(self):
    method test_distro_nonzeroweights (line 160) | def test_distro_nonzeroweights(self):
    method test_distro_zero (line 186) | def test_distro_zero(self):

FILE: tests/unit_tests/encoder/date/test_datetime.py
  class TestDatetimeEncoder (line 11) | class TestDatetimeEncoder(unittest.TestCase):
    method _create_timestamp (line 13) | def _create_timestamp():
    method test_raise_encode_type (line 18) | def test_raise_encode_type(self):
    method test_encode (line 24) | def test_encode(self):
    method test_decode (line 33) | def test_decode(self):
    method test_sinusoidal_encoding (line 52) | def test_sinusoidal_encoding(self):
    method test_cap_invalid_dates (line 66) | def test_cap_invalid_dates(self):

FILE: tests/unit_tests/encoder/identity/test_identity.py
  class TestIdentityEncoder (line 7) | class TestIdentityEncoder(unittest.TestCase):
    method test_encode_and_decode (line 8) | def test_encode_and_decode(self):

FILE: tests/unit_tests/encoder/images/test_img_2_vec.py
  class TestImg2VecEncoder (line 9) | class TestImg2VecEncoder(unittest.TestCase):
    method test_encode (line 10) | def test_encode(self):
    method run_test_encoder_on_device (line 31) | def run_test_encoder_on_device(self, device):
    method test_encoder_on_cpu (line 37) | def test_encoder_on_cpu(self):
    method test_encoder_on_cuda (line 41) | def test_encoder_on_cuda(self):

FILE: tests/unit_tests/encoder/numeric/test_numeric.py
  function _pollute (line 10) | def _pollute(array):
  class TestNumericEncoder (line 18) | class TestNumericEncoder(unittest.TestCase):
    method test_encode_and_decode (line 19) | def test_encode_and_decode(self):
    method test_positive_domain (line 50) | def test_positive_domain(self):
    method test_log_overflow_and_none (line 61) | def test_log_overflow_and_none(self):
    method test_nan_encoding (line 74) | def test_nan_encoding(self):
    method test_weights (line 114) | def test_weights(self):

FILE: tests/unit_tests/encoder/text/test_pretrained.py
  function create_synthetic_data (line 15) | def create_synthetic_data(n, ptrain=0.7):
  class TestPretrainedLangEncoder (line 62) | class TestPretrainedLangEncoder(unittest.TestCase):
    method test_encode_and_decode (line 63) | def test_encode_and_decode(self):
    method test_embed_mode (line 93) | def test_embed_mode(self):
    method test_auto_embed_mode (line 120) | def test_auto_embed_mode(self):
    method run_test_encoder_on_device (line 154) | def run_test_encoder_on_device(self, device):
    method test_encoder_on_cpu (line 163) | def test_encoder_on_cpu(self):
    method test_encoder_on_cuda (line 167) | def test_encoder_on_cuda(self):

FILE: tests/unit_tests/encoder/text/test_short.py
  function generate_sentences (line 73) | def generate_sentences(min_, max_, vocab_size):
  class TestShortTextEncoder (line 78) | class TestShortTextEncoder(unittest.TestCase):
    method test_smallvocab_target_auto_mode (line 79) | def test_smallvocab_target_auto_mode(self):
    method test_non_smallvocab_target_auto_mode (line 102) | def test_non_smallvocab_target_auto_mode(self):
    method test_smallvocab_non_target_auto_mode (line 130) | def test_smallvocab_non_target_auto_mode(self):
    method test_non_smallvocab_non_target_auto_mode (line 149) | def test_non_smallvocab_non_target_auto_mode(self):
    method test_smallvocab_non_target_manual_mode (line 168) | def test_smallvocab_non_target_manual_mode(self):
    method test_non_smallvocab_non_target_manual_mode (line 189) | def test_non_smallvocab_non_target_manual_mode(self):
    method check_encoder_on_device (line 210) | def check_encoder_on_device(self, device):
    method test_encoder_on_cpu (line 216) | def test_encoder_on_cpu(self):
    method test_encoder_on_cuda (line 220) | def test_encoder_on_cuda(self):

FILE: tests/unit_tests/encoder/text/test_tfidf.py
  class TestTfidfEncoder (line 7) | class TestTfidfEncoder(unittest.TestCase):
    method test_encode (line 8) | def test_encode(self):

FILE: tests/unit_tests/encoder/text/test_vocab.py
  class TestVocabularyEncoder (line 6) | class TestVocabularyEncoder(unittest.TestCase):
    method test_encode_decode (line 7) | def test_encode_decode(self):

FILE: tests/unit_tests/encoder/time_series/test_timeseries_rnn.py
  class TestRnnEncoder (line 9) | class TestRnnEncoder(unittest.TestCase):
    method test_minmax_normalizer (line 11) | def test_minmax_normalizer(self):
    method test_cat_normalizer (line 22) | def test_cat_normalizer(self):
    method test_overfit (line 38) | def test_overfit(self):
    method check_encoder_on_device (line 88) | def check_encoder_on_device(self, device):
    method test_encoder_on_cpu (line 101) | def test_encoder_on_cpu(self):
    method test_encoder_on_cuda (line 105) | def test_encoder_on_cuda(self):

FILE: tests/unit_tests/encoder/time_series/test_transformer.py
  class TestTransformerEncoder (line 10) | class TestTransformerEncoder(unittest.TestCase):
    method test_get_chunk (line 11) | def test_get_chunk(self):
    method test_mask (line 37) | def test_mask(self):
    method test_overfit (line 54) | def test_overfit(self):

FILE: tests/unit_tests/helpers.py
  class TestTSDifferencer (line 9) | class TestTSDifferencer(unittest.TestCase):
    method test_numerical (line 10) | def test_numerical(self):

FILE: tests/unit_tests/mixer/test_lgbm.py
  class TestBasic (line 12) | class TestBasic(unittest.TestCase):
    method get_submodels (line 14) | def get_submodels(self):
    method test_0_regression (line 30) | def test_0_regression(self):

FILE: tests/unit_tests/mixer/test_nhits.py
  class TestBasic (line 11) | class TestBasic(unittest.TestCase):
    method get_submodels (line 12) | def get_submodels(self):
    method test_0_regression (line 28) | def test_0_regression(self):

FILE: tests/unit_tests/mixer/test_random_forest.py
  class TestBasic (line 12) | class TestBasic(unittest.TestCase):
    method get_submodels (line 14) | def get_submodels(self):
    method test_0_regression (line 30) | def test_0_regression(self):
    method test_1_binary (line 44) | def test_1_binary(self):

FILE: tests/unit_tests/mixer/test_tabtransformer.py
  class TestBasic (line 11) | class TestBasic(unittest.TestCase):
    method get_submodels (line 12) | def get_submodels(self):
    method test_0_regression (line 23) | def test_0_regression(self):
    method test_1_binary (line 37) | def test_1_binary(self):

FILE: tests/unit_tests/mixer/test_xgboost.py
  class TestBasic (line 10) | class TestBasic(unittest.TestCase):
    method get_submodels (line 12) | def get_submodels(self):
    method test_0_regression (line 28) | def test_0_regression(self):

FILE: tests/utils/data_generation.py
  function generate_timeseries (line 7) | def generate_timeseries(
  function rand_ascii_str (line 27) | def rand_ascii_str(length=30):
  function rand_int (line 32) | def rand_int():
  function rand_float (line 36) | def rand_float():
  function generate_value_cols (line 40) | def generate_value_cols(types, length, ts_period=48 * 3600):
  function generate_log_labels (line 79) | def generate_log_labels(columns, separator=','):
  function generate_timeseries_labels (line 95) | def generate_timeseries_labels(columns):
  function columns_to_file (line 121) | def columns_to_file(columns, filename, headers=None):

FILE: tests/utils/timing.py
  function train_and_check_time_aim (line 6) | def train_and_check_time_aim(predictor: PredictorInterface, train_df: pd...
Condensed preview — 248 files, each showing path, character count, and a content snippet. Download the .json file or copy for the full structured content (5,692K chars).
[
  {
    "path": ".deepsource.toml",
    "chars": 106,
    "preview": "version = 1\n\n[[analyzers]]\nname = \"python\"\nenabled = true\n\n  [analyzers.meta]\n  runtime_version = \"3.x.x\"\n"
  },
  {
    "path": ".flake8",
    "chars": 120,
    "preview": "[flake8]\nmax-line-length = 120\nignore = E275,E402,F821,W503,W504,C408,W391,E721\nexclude = .git,__pycache__,docs,docssrc\n"
  },
  {
    "path": ".github/ISSUE_TEMPLATE/bug_report.md",
    "chars": 288,
    "preview": "---\nname: Bug report\nabout: Create a report to help us improve\nlabels: bug\n---\n\n## Your Environment\n* Python version:\n* "
  },
  {
    "path": ".github/ISSUE_TEMPLATE/question.md",
    "chars": 61,
    "preview": "---\nname: Question\nabout: Ask a question\nlabels: question\n---"
  },
  {
    "path": ".github/ISSUE_TEMPLATE/suggestion.md",
    "chars": 104,
    "preview": "---\nname: Suggestion\nabout: Suggest a feature, improvement, doc change, etc.\nlabels: enhancement\n---\n\n\n\n"
  },
  {
    "path": ".github/PULL_REQUEST_TEMPLATE/pull_request_template.md",
    "chars": 42,
    "preview": "# Why is it needed ?\n\n# What does it do ?\n"
  },
  {
    "path": ".github/workflows/add_to_docs_project.yml",
    "chars": 481,
    "preview": "name: Add issue to docs project\n\non:\n  issues:\n    types:\n      - opened\n\njobs:\n  add-to-project:\n    name: Add issue to"
  },
  {
    "path": ".github/workflows/add_to_roadmap_project.yml",
    "chars": 423,
    "preview": "name: Add issue to roadmap project\non:\n  issues:\n    types:\n      - opened\njobs:\n  add-to-project:\n    name: Add issue t"
  },
  {
    "path": ".github/workflows/benchmark_check.yml",
    "chars": 695,
    "preview": "name: Benchmark Result Check Lightwood\n\n#on:\n#  pull_request:\n#    branches:\n#      - main\n\njobs:\n  check:\n    runs-on: "
  },
  {
    "path": ".github/workflows/cla.yml",
    "chars": 988,
    "preview": "name: \"Lightwood CLA Assistant\"\non:\n  issue_comment:\n    types: [created]\n  pull_request_target:\n    types: [opened,clos"
  },
  {
    "path": ".github/workflows/doc_build.yml",
    "chars": 2029,
    "preview": "name: Documentation Build Lightwood\n\non:\n  push:\n    branches:\n      - main\n      - separate_doc_branch\n      - jupyter_"
  },
  {
    "path": ".github/workflows/lightwood.yml",
    "chars": 2121,
    "preview": "name: Integration and Unit Tests Lightwood\n\non:\n  push:\n  pull_request:\n    branches:\n      - main\n  release:\n    types:"
  },
  {
    "path": ".gitignore",
    "chars": 734,
    "preview": "*.pth\n*.vec\n*.pkl\n*.dill\n*.test.*\n.cache*\n*.jar\nmindsdb.egg-info\n.pypirc\n\n# Byte-compiled / optimized / DLL files\n*__pyc"
  },
  {
    "path": ".nojekyll",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "CODE_OF_CONDUCT.md",
    "chars": 3361,
    "preview": "\n# Contributor Covenant Code of Conduct\n\n## Our Pledge\n\nWe as members, contributors, and leaders pledge to make particip"
  },
  {
    "path": "CONTRIBUTING.md",
    "chars": 4919,
    "preview": "\n# Contribute to Lightwood\n\nWe love to receive contributions from the community and hear your opinions! We want to make "
  },
  {
    "path": "LICENSE",
    "chars": 35104,
    "preview": "GNU GENERAL PUBLIC LICENSE\n                       Version 3, 29 June 2007\n\n Copyright (C) 2007 Free Software Foundation,"
  },
  {
    "path": "README.md",
    "chars": 10955,
    "preview": "# Lightwood\n\n<!--- badges here? --->\n\nLightwood is an AutoML framework that enables you to generate and customize machin"
  },
  {
    "path": "__init__.py",
    "chars": 19,
    "preview": "name = \"lightwood\"\n"
  },
  {
    "path": "assets/contributions-agreement/signatures/cla.json",
    "chars": 31,
    "preview": "{\n   \"signedContributors\": []\n}"
  },
  {
    "path": "docssrc/Makefile",
    "chars": 822,
    "preview": "# Minimal makefile for Sphinx documentation\n#\n\n# You can set these variables from the command line, and also\n# from the "
  },
  {
    "path": "docssrc/README.md",
    "chars": 753,
    "preview": "## Compiling the docs\n- Make sure you are in `docssrc`, then follow the instructions under `run` in our [documentation b"
  },
  {
    "path": "docssrc/source/_static/custom.css",
    "chars": 5274,
    "preview": "/* override css for readable.css */\n\n/*  styles/fonts to match http://mdanalysis.org (see public/css) */\n/* MindsDB --sh"
  },
  {
    "path": "docssrc/source/analysis.rst",
    "chars": 185,
    "preview": ":mod:`Analysis`\n==========================\n\nAnalyse mixer ensembles to extract static insights and train predict-time mo"
  },
  {
    "path": "docssrc/source/api/dtype.rst",
    "chars": 666,
    "preview": "Data Types (dtypes)\n--------------------\nLightwood supports several data types used in standard machine learning pipelin"
  },
  {
    "path": "docssrc/source/api/encode.rst",
    "chars": 78,
    "preview": "Encode your data\n--------------------\n\n.. automodule:: api.encode\n   :members:"
  },
  {
    "path": "docssrc/source/api/high_level.rst",
    "chars": 80,
    "preview": "JSON-AI Config\n--------------------\n\n.. automodule:: api.high_level\n   :members:"
  },
  {
    "path": "docssrc/source/api/json_ai.rst",
    "chars": 77,
    "preview": "JSON-AI Config\n--------------------\n\n.. automodule:: api.json_ai\n   :members:"
  },
  {
    "path": "docssrc/source/api/predictor.rst",
    "chars": 183,
    "preview": "Predictor Interface\n--------------------\nThe ``PredictorInterface`` creates the skeletal structure around basic function"
  },
  {
    "path": "docssrc/source/api/types.rst",
    "chars": 251,
    "preview": "Lightwood API Types\n--------------------\nLightwood consists of several high level abstractions to enable the data scienc"
  },
  {
    "path": "docssrc/source/api.rst",
    "chars": 249,
    "preview": ":mod:`API`\n==========================\n\nThe API module is how Lightwood interfaces with the user.\n\n.. toctree::\n   :maxde"
  },
  {
    "path": "docssrc/source/conf.py",
    "chars": 3438,
    "preview": "# Configuration file for the Sphinx documentation builder.\n#\n# This file only contains a selection of the most common op"
  },
  {
    "path": "docssrc/source/data.rst",
    "chars": 209,
    "preview": ":mod:`Data`\n==========================\n\nThe focus of these modules is on storing, transforming, cleaning, splitting, mer"
  },
  {
    "path": "docssrc/source/encoder.rst",
    "chars": 163,
    "preview": ":mod:`Encoders`\n==========================\n\nUsed for encoding data into PyTorch tensors and decoding it from pytorch ten"
  },
  {
    "path": "docssrc/source/ensemble.rst",
    "chars": 140,
    "preview": ":mod:`Ensemble`\n==========================\n\nEnsemble mixers together in order to generate predictions\n\n.. automodule:: e"
  },
  {
    "path": "docssrc/source/helpers.rst",
    "chars": 105,
    "preview": ":mod:`Helpers`\n==========================\n\nVarious helper functions\n\n.. automodule:: helpers\n   :members:"
  },
  {
    "path": "docssrc/source/index.rst",
    "chars": 10058,
    "preview": ".. -*- coding: utf-8 -*-\n.. lightwood_docs documentation master file, created by\n   sphinx-quickstart on Tue Sep  7 13:0"
  },
  {
    "path": "docssrc/source/lightwood_philosophy.rst",
    "chars": 5632,
    "preview": ":mod:`Lightwood Philosophy`\n================================\n\n\nIntroduction\n------------\n\nLightwood works by generating "
  },
  {
    "path": "docssrc/source/mixer.rst",
    "chars": 163,
    "preview": ":mod:`Mixers`\n==========================\n\nMixers learn to map encoded representation, they are the core of lightwood's a"
  },
  {
    "path": "docssrc/source/tutorials/README.md",
    "chars": 1857,
    "preview": "## How to make a tutorial notebook?\n\nWe use some of our tutorial notebooks as unit-tests to ensure that our pipeline is "
  },
  {
    "path": "docssrc/source/tutorials/custom_cleaner/custom_cleaner.ipynb",
    "chars": 23037,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"id\": \"regulated-manufacturer\",\n   \"metadata\": {},\n   \"source\": [\n    \""
  },
  {
    "path": "docssrc/source/tutorials/custom_encoder_rulebased/custom_encoder_rulebased.ipynb",
    "chars": 21500,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"id\": \"smooth-philip\",\n   \"metadata\": {},\n   \"source\": [\n    \"### Custo"
  },
  {
    "path": "docssrc/source/tutorials/custom_explainer/custom_explainer.ipynb",
    "chars": 17821,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Tutorial - Implementing a custom "
  },
  {
    "path": "docssrc/source/tutorials/custom_mixer/custom_mixer.ipynb",
    "chars": 25784,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Tutorial - Implementing a custom "
  },
  {
    "path": "docssrc/source/tutorials/custom_splitter/custom_splitter.ipynb",
    "chars": 84118,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"id\": \"israeli-spyware\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Buil"
  },
  {
    "path": "docssrc/source/tutorials/tutorial_data_analysis/tutorial_data_analysis.ipynb",
    "chars": 352811,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Tutorial - Introduction to Lightw"
  },
  {
    "path": "docssrc/source/tutorials/tutorial_time_series/tutorial_time_series.ipynb",
    "chars": 81441,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Tutorial - Time series forecastin"
  },
  {
    "path": "docssrc/source/tutorials/tutorial_update_models/tutorial_update_models.ipynb",
    "chars": 8800,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Introduction\\n\",\n    \"\\n\",\n    \"I"
  },
  {
    "path": "docssrc/source/tutorials.rst",
    "chars": 862,
    "preview": ":mod:`Tutorials`\n==========================\n.. toctree::\n   :maxdepth: 1\n   :caption: Table of Contents:\n\n| `Construct a"
  },
  {
    "path": "lightwood/__about__.py",
    "chars": 399,
    "preview": "__title__ = 'lightwood'\n__package_name__ = 'lightwood'\n__version__ = '25.12.1.0'\n__description__ = \"Lightwood is a toolk"
  },
  {
    "path": "lightwood/__init__.py",
    "chars": 376,
    "preview": "import os\nimport logging\nlogging.getLogger('matplotlib').setLevel(level=logging.WARNING)\nfrom lightwood.api import __all"
  },
  {
    "path": "lightwood/analysis/__init__.py",
    "chars": 883,
    "preview": "# Phases\nfrom lightwood.analysis.analyze import model_analyzer\nfrom lightwood.analysis.explain import explain\n\n# Base bl"
  },
  {
    "path": "lightwood/analysis/analyze.py",
    "chars": 5255,
    "preview": "from typing import Dict, List, Tuple, Optional\n\nimport numpy as np\nfrom dataprep_ml import StatisticalAnalysis\n\nfrom lig"
  },
  {
    "path": "lightwood/analysis/base.py",
    "chars": 2456,
    "preview": "from typing import Tuple, Dict, Optional\n\nimport pandas as pd\nfrom lightwood.helpers.log import log\n\n\nclass BaseAnalysis"
  },
  {
    "path": "lightwood/analysis/explain.py",
    "chars": 3463,
    "preview": "from typing import Optional, List, Dict\nimport torch\nimport pandas as pd\n\nfrom dataprep_ml import StatisticalAnalysis\n\nf"
  },
  {
    "path": "lightwood/analysis/helpers/__init__.py",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "lightwood/analysis/helpers/acc_stats.py",
    "chars": 7250,
    "preview": "import random\nfrom types import SimpleNamespace\nfrom typing import Dict, Optional\n\nimport numpy as np\nfrom sklearn.metri"
  },
  {
    "path": "lightwood/analysis/helpers/conf_stats.py",
    "chars": 4305,
    "preview": "from copy import deepcopy\nfrom typing import Dict\nfrom types import SimpleNamespace\n\nfrom sklearn.preprocessing import O"
  },
  {
    "path": "lightwood/analysis/helpers/feature_importance.py",
    "chars": 5163,
    "preview": "from copy import deepcopy\nfrom types import SimpleNamespace\nfrom typing import Dict\n\nimport numpy as np\nfrom sklearn.uti"
  },
  {
    "path": "lightwood/analysis/helpers/pyod.py",
    "chars": 3949,
    "preview": "from copy import deepcopy\nfrom types import SimpleNamespace\nfrom typing import Dict, Optional, Tuple\n\nimport pandas as p"
  },
  {
    "path": "lightwood/analysis/helpers/shap.py",
    "chars": 4156,
    "preview": "import warnings\nfrom types import SimpleNamespace\nfrom typing import Dict, Optional, Tuple\n\nimport numpy as np\nimport pa"
  },
  {
    "path": "lightwood/analysis/nc/LICENSE",
    "chars": 1082,
    "preview": "The MIT License (MIT)\n\nCopyright (c) 2015 Henrik Linusson\n\nPermission is hereby granted, free of charge, to any person o"
  },
  {
    "path": "lightwood/analysis/nc/__init__.py",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "lightwood/analysis/nc/base.py",
    "chars": 5255,
    "preview": "# Original author: Henrik Linusson (github.com/donlnz)\nimport abc\nfrom typing import Dict\nimport numpy as np\nfrom sklear"
  },
  {
    "path": "lightwood/analysis/nc/calibrate.py",
    "chars": 29302,
    "preview": "import inspect\nfrom copy import deepcopy\nfrom typing import Dict, Tuple, Optional\nfrom types import SimpleNamespace\n\nimp"
  },
  {
    "path": "lightwood/analysis/nc/icp.py",
    "chars": 16835,
    "preview": "\"\"\"\nInductive conformal predictors.\n\"\"\"\n# Original author: Henrik Linusson (github.com/donlnz)\nfrom collections import d"
  },
  {
    "path": "lightwood/analysis/nc/metrics.py",
    "chars": 5728,
    "preview": "# Original author: Henrik Linusson (github.com/donlnz)\n\nimport numpy as np\n\n\n# -----------------------------------------"
  },
  {
    "path": "lightwood/analysis/nc/nc.py",
    "chars": 22330,
    "preview": "\"\"\"\nNonconformity functions.\n\"\"\"\n\n# Original author: Henrik Linusson (github.com/donlnz)\n\nimport abc\nimport numpy as np\n"
  },
  {
    "path": "lightwood/analysis/nc/norm.py",
    "chars": 5722,
    "preview": "from typing import Union\n\nimport numpy as np\nimport pandas as pd\nfrom scipy.stats import entropy\nfrom sklearn.linear_mod"
  },
  {
    "path": "lightwood/analysis/nc/util.py",
    "chars": 10049,
    "preview": "from typing import Union, Optional\n\nimport torch\nimport numpy as np\nimport pandas as pd\nfrom torch.nn.functional import "
  },
  {
    "path": "lightwood/analysis/nn_conf/__init__.py",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "lightwood/analysis/nn_conf/temp_scale.py",
    "chars": 4369,
    "preview": "from copy import deepcopy\nfrom typing import Dict, Tuple\nfrom types import SimpleNamespace\n\nimport torch\nimport pandas a"
  },
  {
    "path": "lightwood/analysis/nn_conf/temp_scale_license",
    "chars": 1068,
    "preview": "MIT License\n\nCopyright (c) 2017 Geoff Pleiss\n\nPermission is hereby granted, free of charge, to any person obtaining a co"
  },
  {
    "path": "lightwood/api/__init__.py",
    "chars": 753,
    "preview": "from lightwood.api.types import (\n    JsonAI,\n    ProblemDefinition,\n    TimeseriesSettings,\n    ModelAnalysis,\n    Pred"
  },
  {
    "path": "lightwood/api/high_level.py",
    "chars": 6699,
    "preview": "import os\nfrom typing import Union\nimport dill\nimport pandas as pd\nfrom lightwood.api.types import JsonAI, ProblemDefini"
  },
  {
    "path": "lightwood/api/json_ai.py",
    "chars": 30130,
    "preview": "# TODO: add_implicit_values unit test ensures NO changes for a fully specified file.\nimport inspect\n\nfrom type_infer.dty"
  },
  {
    "path": "lightwood/api/predictor.py",
    "chars": 8904,
    "preview": "import dill\nfrom typing import Dict, Optional\n\nimport pandas as pd\nfrom lightwood.api.types import ModelAnalysis\n\n\n# Int"
  },
  {
    "path": "lightwood/api/types.py",
    "chars": 25315,
    "preview": "# TODO: type hint the returns\n\nfrom typing import Dict, List, Optional, Union\nimport sys\n\nfrom type_infer.dtype import d"
  },
  {
    "path": "lightwood/data/__init__.py",
    "chars": 296,
    "preview": "from lightwood.data.timeseries_transform import transform_timeseries\nfrom lightwood.data.timeseries_analyzer import time"
  },
  {
    "path": "lightwood/data/encoded_ds.py",
    "chars": 10395,
    "preview": "import inspect\nfrom typing import List, Tuple, Dict\nimport torch\nimport numpy as np\nimport pandas as pd\nfrom torch.utils"
  },
  {
    "path": "lightwood/data/timeseries_analyzer.py",
    "chars": 5142,
    "preview": "from typing import Dict, Tuple, List\n\nimport numpy as np\nimport pandas as pd\nfrom type_infer.dtype import dtype\n\nfrom li"
  },
  {
    "path": "lightwood/data/timeseries_transform.py",
    "chars": 14743,
    "preview": "from typing import Dict, Optional\nfrom functools import partial\nimport multiprocessing as mp\n\nimport numpy as np\nimport "
  },
  {
    "path": "lightwood/encoder/__init__.py",
    "chars": 1770,
    "preview": "# Encoders which should always work\nfrom lightwood.encoder.base import BaseEncoder\nfrom lightwood.encoder.datetime.datet"
  },
  {
    "path": "lightwood/encoder/array/__init__.py",
    "chars": 83,
    "preview": "from lightwood.encoder.array.array import ArrayEncoder\n\n__all__ = ['ArrayEncoder']\n"
  },
  {
    "path": "lightwood/encoder/array/array.py",
    "chars": 5366,
    "preview": "import torch\nimport pandas as pd\nimport numpy as np\nfrom lightwood.encoder.base import BaseEncoder\nfrom type_infer.dtype"
  },
  {
    "path": "lightwood/encoder/array/ts_cat_array.py",
    "chars": 4768,
    "preview": "from typing import List, Dict, Iterable, Optional\n\nimport torch\nimport torch.nn.functional as F\n\nfrom lightwood.encoder "
  },
  {
    "path": "lightwood/encoder/array/ts_num_array.py",
    "chars": 4534,
    "preview": "from typing import List, Dict, Iterable, Optional\n\nimport torch\n\nfrom lightwood.encoder import BaseEncoder\nfrom lightwoo"
  },
  {
    "path": "lightwood/encoder/audio/__init__.py",
    "chars": 226,
    "preview": "# This encoder is optional since it's underlying dependency (librosa) needs system dependencies\ntry:\n    from lightwood."
  },
  {
    "path": "lightwood/encoder/audio/mfcc.py",
    "chars": 3575,
    "preview": "import librosa\nimport torch\nimport warnings\nfrom lightwood.encoder.base import BaseEncoder\nfrom lightwood.helpers.io imp"
  },
  {
    "path": "lightwood/encoder/base.py",
    "chars": 3960,
    "preview": "from typing import List, Iterable\nimport torch\n\n\nclass BaseEncoder:\n    \"\"\"\n    Base class for all encoders.\n    \n    An"
  },
  {
    "path": "lightwood/encoder/categorical/__init__.py",
    "chars": 377,
    "preview": "from lightwood.encoder.categorical.onehot import OneHotEncoder\nfrom lightwood.encoder.categorical.simple_label import Si"
  },
  {
    "path": "lightwood/encoder/categorical/autoencoder.py",
    "chars": 10217,
    "preview": "import random\nimport numpy as np\nimport torch\nfrom torch.utils.data import DataLoader\nfrom lightwood.mixer.helpers.range"
  },
  {
    "path": "lightwood/encoder/categorical/binary.py",
    "chars": 8535,
    "preview": "from copy import deepcopy as dc\nfrom typing import Dict, List, Iterable, Tuple\n\nimport torch\nimport numpy as np\nfrom sci"
  },
  {
    "path": "lightwood/encoder/categorical/gym.py",
    "chars": 5348,
    "preview": "import copy\nimport time\nimport torch\n\nimport numpy as np\nfrom lightwood.helpers.torch import LightwoodAutocast\nfrom ligh"
  },
  {
    "path": "lightwood/encoder/categorical/multihot.py",
    "chars": 1439,
    "preview": "import torch\nimport numpy as np\nfrom lightwood.encoder import BaseEncoder\nfrom sklearn.preprocessing import MultiLabelBi"
  },
  {
    "path": "lightwood/encoder/categorical/onehot.py",
    "chars": 7754,
    "preview": "from copy import deepcopy\nfrom typing import Dict, List, Iterable, Tuple\n\nimport torch\nimport numpy as np\nfrom scipy.spe"
  },
  {
    "path": "lightwood/encoder/categorical/simple_label.py",
    "chars": 2540,
    "preview": "from typing import List, Union\nfrom collections import defaultdict\nimport pandas as pd\nimport numpy as np\nimport torch\n\n"
  },
  {
    "path": "lightwood/encoder/datetime/__init__.py",
    "chars": 213,
    "preview": "from lightwood.encoder.datetime.datetime import DatetimeEncoder\nfrom lightwood.encoder.datetime.datetime_sin_normalizer "
  },
  {
    "path": "lightwood/encoder/datetime/datetime.py",
    "chars": 3589,
    "preview": "from typing import Union\n\nimport torch\nimport numpy as np\nimport pandas as pd\n\nfrom lightwood.encoder.base import BaseEn"
  },
  {
    "path": "lightwood/encoder/datetime/datetime_sin_normalizer.py",
    "chars": 4189,
    "preview": "import datetime\nimport calendar\nimport numpy as np\nimport pandas as pd  # @TODO: remove?\nimport torch\nfrom lightwood.enc"
  },
  {
    "path": "lightwood/encoder/helpers.py",
    "chars": 2538,
    "preview": "import torch\nimport numpy as np\nfrom sklearn.preprocessing import MinMaxScaler, OneHotEncoder, OrdinalEncoder\n\n\nclass Mi"
  },
  {
    "path": "lightwood/encoder/identity/__init__.py",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "lightwood/encoder/identity/identity.py",
    "chars": 1944,
    "preview": "import numpy as np\nfrom typing import List, Iterable, Union\nimport torch\nfrom lightwood.encoder.base import BaseEncoder\n"
  },
  {
    "path": "lightwood/encoder/image/__init__.py",
    "chars": 144,
    "preview": "try:\n    from lightwood.encoder.image.img_2_vec import Img2VecEncoder\nexcept Exception:\n    Img2VecEncoder = None\n\n__all"
  },
  {
    "path": "lightwood/encoder/image/helpers/__init__.py",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "lightwood/encoder/image/helpers/img_to_vec.py",
    "chars": 2170,
    "preview": "import torch\r\nimport torch.nn as nn\r\nfrom lightwood.helpers.device import get_device_from_name\r\nfrom lightwood.helpers.t"
  },
  {
    "path": "lightwood/encoder/image/img_2_vec.py",
    "chars": 4409,
    "preview": "from typing import List, Tuple, Iterable\nimport torch\nfrom lightwood.encoder.image.helpers.img_to_vec import Img2Vec\nfro"
  },
  {
    "path": "lightwood/encoder/numeric/__init__.py",
    "chars": 178,
    "preview": "from lightwood.encoder.numeric.numeric import NumericEncoder\nfrom lightwood.encoder.numeric.ts_numeric import TsNumericE"
  },
  {
    "path": "lightwood/encoder/numeric/numeric.py",
    "chars": 7485,
    "preview": "import math\nfrom typing import Union, Dict\nfrom copy import deepcopy as dc\n\nimport torch\nimport numpy as np\nimport panda"
  },
  {
    "path": "lightwood/encoder/numeric/ts_numeric.py",
    "chars": 4656,
    "preview": "from typing import Union, List, Dict\n\nimport torch\nimport numpy as np\nimport pandas as pd\n\nfrom lightwood.encoder.numeri"
  },
  {
    "path": "lightwood/encoder/text/__init__.py",
    "chars": 334,
    "preview": "from lightwood.encoder.text.pretrained import PretrainedLangEncoder\nfrom lightwood.encoder.text.tfidf import TfidfEncode"
  },
  {
    "path": "lightwood/encoder/text/helpers/__init__.py",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "lightwood/encoder/text/helpers/pretrained_helpers.py",
    "chars": 652,
    "preview": "\"\"\"\n2021.03.05\n\nBasic helper functions for PretrainedLangEncoder\n\"\"\"\nimport torch\n\n\nclass TextEmbed(torch.utils.data.Dat"
  },
  {
    "path": "lightwood/encoder/text/pretrained.py",
    "chars": 15930,
    "preview": "import os\nimport time\nfrom typing import Iterable\nfrom collections import deque\n\nimport numpy as np\nimport torch\nfrom to"
  },
  {
    "path": "lightwood/encoder/text/short.py",
    "chars": 3987,
    "preview": "from typing import List\nimport torch\nfrom lightwood.encoder import BaseEncoder\nfrom lightwood.encoder.categorical import"
  },
  {
    "path": "lightwood/encoder/text/tfidf.py",
    "chars": 942,
    "preview": "import torch\nfrom sklearn.feature_extraction.text import TfidfVectorizer\nimport numpy as np\n\nfrom lightwood.encoder.base"
  },
  {
    "path": "lightwood/encoder/text/vocab.py",
    "chars": 1432,
    "preview": "import os\nimport torch\nfrom transformers import DistilBertTokenizer\nfrom lightwood.encoder.base import BaseEncoder\n\n\ncla"
  },
  {
    "path": "lightwood/encoder/time_series/__init__.py",
    "chars": 96,
    "preview": "from lightwood.encoder.time_series.ts import TimeSeriesEncoder\n\n__all__ = ['TimeSeriesEncoder']\n"
  },
  {
    "path": "lightwood/encoder/time_series/helpers/__init__.py",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "lightwood/encoder/time_series/helpers/common.py",
    "chars": 1743,
    "preview": "\nimport pandas as pd\n\nfrom lightwood.api.types import TimeseriesSettings\nfrom type_infer.dtype import dtype\nfrom lightwo"
  },
  {
    "path": "lightwood/encoder/time_series/helpers/rnn_helpers.py",
    "chars": 3890,
    "preview": "import torch\nimport torch.nn as nn\nimport numpy as np\nfrom lightwood.helpers.torch import LightwoodAutocast\n\n\nclass Deco"
  },
  {
    "path": "lightwood/encoder/time_series/helpers/transformer_helpers.py",
    "chars": 5184,
    "preview": "import math\nimport torch\nimport torch.nn as nn\nfrom lightwood.helpers.torch import LightwoodAutocast\n\n\ndef len_to_mask(l"
  },
  {
    "path": "lightwood/encoder/time_series/rnn.py",
    "chars": 23548,
    "preview": "import time\nfrom math import gcd\nfrom typing import List\nfrom copy import deepcopy\n\nimport numpy as np\nimport pandas as "
  },
  {
    "path": "lightwood/encoder/time_series/ts.py",
    "chars": 2146,
    "preview": "from typing import List, Iterable\n\nimport torch\n\nfrom type_infer.dtype import dtype\nfrom lightwood.encoder.array import "
  },
  {
    "path": "lightwood/ensemble/__init__.py",
    "chars": 699,
    "preview": "from lightwood.ensemble.base import BaseEnsemble\nfrom lightwood.ensemble.identity import IdentityEnsemble\nfrom lightwood"
  },
  {
    "path": "lightwood/ensemble/base.py",
    "chars": 1625,
    "preview": "from typing import List, Optional\n\nimport pandas as pd\n\nfrom lightwood.mixer.base import BaseMixer\nfrom lightwood.data.e"
  },
  {
    "path": "lightwood/ensemble/best_of.py",
    "chars": 3119,
    "preview": "from typing import List, Optional\n\nimport numpy as np\nimport pandas as pd\nfrom mindsdb_evaluator import evaluate_accurac"
  },
  {
    "path": "lightwood/ensemble/embed.py",
    "chars": 995,
    "preview": "from typing import List\nimport pandas as pd\n\nfrom lightwood.mixer.base import BaseMixer\nfrom lightwood.ensemble.base imp"
  },
  {
    "path": "lightwood/ensemble/identity.py",
    "chars": 1631,
    "preview": "from typing import List\nimport pandas as pd\n\nfrom lightwood.mixer.base import BaseMixer\nfrom lightwood.ensemble.base imp"
  },
  {
    "path": "lightwood/ensemble/mean_ensemble.py",
    "chars": 1285,
    "preview": "from typing import List, Optional\n\nimport pandas as pd\n\nfrom lightwood.mixer.base import BaseMixer\nfrom lightwood.ensemb"
  },
  {
    "path": "lightwood/ensemble/mode_ensemble.py",
    "chars": 4042,
    "preview": "from typing import List, Optional, Dict\n\nimport pandas as pd\nimport numpy as np\n\nfrom lightwood.mixer.base import BaseMi"
  },
  {
    "path": "lightwood/ensemble/stacked_ensemble.py",
    "chars": 3246,
    "preview": "from typing import List, Optional\n\nimport numpy as np\nimport pandas as pd\n\nimport torch\nfrom torch import nn\nfrom torch."
  },
  {
    "path": "lightwood/ensemble/ts_stacked_ensemble.py",
    "chars": 2948,
    "preview": "from copy import deepcopy\nfrom typing import List, Optional\n\nimport torch\nfrom torch import nn\nfrom torch.optim import S"
  },
  {
    "path": "lightwood/ensemble/weighted_mean_ensemble.py",
    "chars": 3212,
    "preview": "from typing import List, Optional\n\nimport numpy as np\nimport pandas as pd\n\nfrom lightwood.helpers.log import log\nfrom ty"
  },
  {
    "path": "lightwood/helpers/__init__.py",
    "chars": 945,
    "preview": "from lightwood.helpers.device import is_cuda_compatible, get_devices\nfrom lightwood.helpers.general import is_none\nfrom "
  },
  {
    "path": "lightwood/helpers/codegen.py",
    "chars": 23048,
    "preview": "import os\nimport sys\nimport time\nimport string\nimport random\nimport tempfile\nimport importlib\n\nfrom copy import deepcopy"
  },
  {
    "path": "lightwood/helpers/constants.py",
    "chars": 2047,
    "preview": "\"\"\"\nReserved constants for Lightwood.\n\"\"\"\n\n_UNCOMMON_WORD = '__mdb_unknown_cat'\n_UNCOMMON_TOKEN = 0\n\n# For custom module"
  },
  {
    "path": "lightwood/helpers/device.py",
    "chars": 3655,
    "preview": "import torch\nimport os\nfrom random import randint\nfrom torch.cuda import device_count, get_device_capability\n\n\ndef is_cu"
  },
  {
    "path": "lightwood/helpers/general.py",
    "chars": 1538,
    "preview": "from typing import Iterable\n\nimport numpy as np\nfrom type_infer.helpers import is_nan_numeric\n\n\ndef is_none(value):\n    "
  },
  {
    "path": "lightwood/helpers/io.py",
    "chars": 580,
    "preview": "import requests\nfrom lightwood.helpers.log import log\nimport os\n\n\ndef read_from_path_or_url(path: str, load_from_path):\n"
  },
  {
    "path": "lightwood/helpers/log.py",
    "chars": 1403,
    "preview": "import os\nimport logging\nimport colorlog\nfrom time import time\nfrom datetime import datetime\nfrom functools import wraps"
  },
  {
    "path": "lightwood/helpers/numeric.py",
    "chars": 201,
    "preview": "from typing import Iterable\nfrom type_infer.helpers import is_nan_numeric\n\n\ndef filter_nan_and_none(series: Iterable) ->"
  },
  {
    "path": "lightwood/helpers/parallelism.py",
    "chars": 2182,
    "preview": "import os\nfrom typing import Dict\nimport psutil\nimport multiprocessing as mp\nfrom lightwood.helpers.log import log\n\nMAX_"
  },
  {
    "path": "lightwood/helpers/seed.py",
    "chars": 253,
    "preview": "import random\nimport torch\nimport numpy as np\n\n\ndef seed(seed_nr: int) -> None:\n    torch.manual_seed(seed_nr)\n    torch"
  },
  {
    "path": "lightwood/helpers/templating.py",
    "chars": 4767,
    "preview": "import json\nfrom typing import Callable\nfrom collections import deque\n\nimport inspect\nimport numpy as np\n\nfrom type_infe"
  },
  {
    "path": "lightwood/helpers/text.py",
    "chars": 1086,
    "preview": "\"\"\"\n*******************************************************\n * Copyright (C) 2017 MindsDB Inc. <copyright@mindsdb.com>\n "
  },
  {
    "path": "lightwood/helpers/torch.py",
    "chars": 4312,
    "preview": "import functools\nimport torch\nfrom torch.nn.functional import pad\nfrom lightwood.helpers.device import get_devices\n\n\ndef"
  },
  {
    "path": "lightwood/helpers/ts.py",
    "chars": 9733,
    "preview": "from typing import Tuple, Dict\nfrom datetime import datetime\n\nimport numpy as np\nimport pandas as pd\n\n\ndef get_ts_groups"
  },
  {
    "path": "lightwood/mixer/__init__.py",
    "chars": 1378,
    "preview": "from lightwood.mixer.base import BaseMixer\nfrom lightwood.mixer.unit import Unit\nfrom lightwood.mixer.neural import Neur"
  },
  {
    "path": "lightwood/mixer/arima.py",
    "chars": 5159,
    "preview": "from typing import Dict, Optional\n\nfrom lightwood.mixer.sktime import SkTime\n\n\nclass ARIMAMixer(SkTime):\n    def __init_"
  },
  {
    "path": "lightwood/mixer/base.py",
    "chars": 3952,
    "preview": "from typing import Optional\nimport pandas as pd\n\nfrom lightwood.data.encoded_ds import EncodedDs\nfrom lightwood.api.type"
  },
  {
    "path": "lightwood/mixer/ets.py",
    "chars": 3147,
    "preview": "from typing import Dict, Optional\n\nfrom lightwood.mixer.sktime import SkTime\n\n\nclass ETSMixer(SkTime):\n    def __init__("
  },
  {
    "path": "lightwood/mixer/helpers/__init__.py",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "lightwood/mixer/helpers/ar_net.py",
    "chars": 2320,
    "preview": "import torch\nfrom torch import nn\nfrom lightwood.mixer.helpers.default_net import DefaultNet\nfrom lightwood.helpers.torc"
  },
  {
    "path": "lightwood/mixer/helpers/default_net.py",
    "chars": 2902,
    "preview": "import math\nimport torch\nfrom lightwood.helpers.torch import LightwoodAutocast\nfrom lightwood.helpers.device import get_"
  },
  {
    "path": "lightwood/mixer/helpers/qclassic_net.py",
    "chars": 4278,
    "preview": "import torch\nimport qiskit\nimport numpy as np\n\nfrom lightwood.mixer.helpers.default_net import DefaultNet\nfrom lightwood"
  },
  {
    "path": "lightwood/mixer/helpers/ranger.py",
    "chars": 5526,
    "preview": "import math\nimport torch\nfrom torch.optim.optimizer import Optimizer\n\n\nclass Ranger(Optimizer):\n    def __init__(\n      "
  },
  {
    "path": "lightwood/mixer/helpers/residual_net.py",
    "chars": 2699,
    "preview": "from typing import List\nimport torch\nfrom torch import nn\nfrom lightwood.helpers.torch import LightwoodAutocast\nfrom lig"
  },
  {
    "path": "lightwood/mixer/helpers/transform_corss_entropy_loss.py",
    "chars": 589,
    "preview": "import torch\nfrom lightwood.helpers.torch import LightwoodAutocast\n\n\n# Basically cross entropy loss that does the one ho"
  },
  {
    "path": "lightwood/mixer/helpers/ts.py",
    "chars": 1955,
    "preview": "\"\"\"\nTime series utility methods for usage within mixers.\n\"\"\"\nfrom typing import Dict\nfrom copy import deepcopy\n\nimport p"
  },
  {
    "path": "lightwood/mixer/lightgbm.py",
    "chars": 16048,
    "preview": "import time\nimport inspect\nfrom typing import Dict, List, Set, Optional\nimport torch\nimport optuna\nimport lightgbm\nimpor"
  },
  {
    "path": "lightwood/mixer/lightgbm_array.py",
    "chars": 3924,
    "preview": "from copy import deepcopy\nfrom typing import Dict, List, Union, Optional\n\nimport numpy as np\nimport pandas as pd\n\nfrom l"
  },
  {
    "path": "lightwood/mixer/neural.py",
    "chars": 15938,
    "preview": "import time\nfrom copy import deepcopy\nfrom collections import deque\nfrom typing import Dict, List, Optional\n\nimport torc"
  },
  {
    "path": "lightwood/mixer/neural_ts.py",
    "chars": 6935,
    "preview": "import time\nfrom copy import deepcopy\nfrom typing import Dict, Optional, List\n\nimport numpy as np\nimport pandas as pd\n\ni"
  },
  {
    "path": "lightwood/mixer/nhits.py",
    "chars": 11999,
    "preview": "from typing import Dict, Union, Optional\nfrom copy import deepcopy\n\nimport numpy as np\nimport pandas as pd\nfrom neuralfo"
  },
  {
    "path": "lightwood/mixer/prophet.py",
    "chars": 3450,
    "preview": "from typing import Dict, Optional, Union\n\nimport pandas as pd\n\nfrom lightwood.mixer.sktime import SkTime\n\n\nclass Prophet"
  },
  {
    "path": "lightwood/mixer/qclassic.py",
    "chars": 1023,
    "preview": "from typing import Dict, List\n\nfrom lightwood.encoder.base import BaseEncoder\nfrom lightwood.mixer.neural import Neural\n"
  },
  {
    "path": "lightwood/mixer/random_forest.py",
    "chars": 9092,
    "preview": "import time\nimport math\nimport torch\nimport numpy as np\nimport pandas as pd\nimport optuna\nfrom typing import Dict, Union"
  },
  {
    "path": "lightwood/mixer/regression.py",
    "chars": 4377,
    "preview": "from typing import Optional\nimport torch\nimport pandas as pd\nfrom scipy.special import softmax\nfrom sklearn.linear_model"
  },
  {
    "path": "lightwood/mixer/sktime.py",
    "chars": 17465,
    "preview": "import importlib\nfrom copy import deepcopy\nfrom datetime import datetime\nfrom typing import Dict, Union, Optional\n\nimpor"
  },
  {
    "path": "lightwood/mixer/tabtransformer.py",
    "chars": 2859,
    "preview": "from typing import Dict, Optional\n\nimport torch\nfrom tab_transformer_pytorch import TabTransformer\n\nfrom lightwood.helpe"
  },
  {
    "path": "lightwood/mixer/unit.py",
    "chars": 2305,
    "preview": "from typing import List, Optional\n\nimport torch\nimport pandas as pd\n\nfrom lightwood.helpers.log import log\nfrom lightwoo"
  },
  {
    "path": "lightwood/mixer/xgboost.py",
    "chars": 13389,
    "preview": "from typing import Dict, List, Optional, Union\nimport time\n\nimport torch\nimport optuna\nimport xgboost as xgb\nimport nump"
  },
  {
    "path": "lightwood/mixer/xgboost_array.py",
    "chars": 3933,
    "preview": "from typing import Dict, List, Union, Optional\nfrom copy import deepcopy\n\nimport numpy as np\nimport pandas as pd\n\nfrom l"
  },
  {
    "path": "pyproject.toml",
    "chars": 1879,
    "preview": "[build-system]\nrequires = [\"poetry-core\"]\nbuild-backend = \"poetry.core.masonry.api\"\n\n[tool.poetry]\nname = \"lightwood\"\nve"
  },
  {
    "path": "tests/__init__.py",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "tests/data/airline_sentiment.csv",
    "chars": 3416340,
    "preview": "tweet_id,airline_sentiment,airline_sentiment_confidence,negativereason,negativereason_confidence,airline,airline_sentime"
  },
  {
    "path": "tests/data/arrivals.csv",
    "chars": 8938,
    "preview": "T,Country,Traffic\n1981-1,Japan,14763\n1981-4,Japan,9321\n1981-7,Japan,10166\n1981-10,Japan,19509\n1982-1,Japan,17117\n1982-4,"
  },
  {
    "path": "tests/data/concrete_strength.csv",
    "chars": 52515,
    "preview": "id,cement,slag,flyAsh,water,superPlasticizer,coarseAggregate,fineAggregate,age,concrete_strength\n0,540.0,0.0,0.0,162.0,2"
  },
  {
    "path": "tests/data/hdi.csv",
    "chars": 8274,
    "preview": "Population,Area (sq. mi.),Pop. Density ,GDP ($ per capita),Literacy (%),Infant mortality ,Development Index\n9944201,1284"
  },
  {
    "path": "tests/data/house_sales.csv",
    "chars": 9255,
    "preview": "saledate,MA,type,bedrooms\r\n30/09/2007,441854,house,2\r\n31/12/2007,441854,house,2\r\n31/03/2008,441854,house,2\r\n30/06/2008,4"
  },
  {
    "path": "tests/data/ionosphere.csv",
    "chars": 76601,
    "preview": "c1,c2,c3,c4,c5,c6,c7,c8,c9,c10,c11,c12,c13,c14,c15,c16,c17,c18,c19,c20,c21,c22,c23,c24,c25,c26,c27,c28,c29,c30,c31,c32,c"
  },
  {
    "path": "tests/data/tripadvisor_binary_sample.csv",
    "chars": 308177,
    "preview": "Review,Label\n\"nice hotel expensive parking got good deal stay hotel anniversary, arrived late evening took advice previo"
  },
  {
    "path": "tests/data/wine_reviews_binary_sample.csv",
    "chars": 157274,
    "preview": "country,description,price,province,region_1,region_2,label\nUS,\"Much like the regular bottling from 2012, this comes acro"
  },
  {
    "path": "tests/integration/__init__.py",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "tests/integration/advanced/__init__.py",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "tests/integration/advanced/test_array.py",
    "chars": 1681,
    "preview": "import unittest\nimport numpy as np\nimport pandas as pd\nfrom lightwood.api.types import ProblemDefinition\nfrom lightwood."
  },
  {
    "path": "tests/integration/advanced/test_custom_modules.py",
    "chars": 3051,
    "preview": "from lightwood.api.high_level import json_ai_from_problem, code_from_json_ai, predictor_from_code, load_custom_module\nfr"
  },
  {
    "path": "tests/integration/advanced/test_text_input.py",
    "chars": 1227,
    "preview": "from lightwood.api.types import ProblemDefinition\nfrom lightwood.api.high_level import predictor_from_problem\nimport pan"
  },
  {
    "path": "tests/integration/advanced/test_timeseries.py",
    "chars": 29897,
    "preview": "import random\nfrom datetime import datetime\nfrom datetime import timedelta\nimport unittest\nimport numpy as np\nimport pan"
  },
  {
    "path": "tests/integration/basic/__init__.py",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "tests/integration/basic/notes.txt",
    "chars": 389,
    "preview": "\nNotes:\n-> Code generation templates + functions that generate code from templates + variables (look into templating lib"
  },
  {
    "path": "tests/integration/basic/test_airline.py",
    "chars": 897,
    "preview": "import unittest\nimport pandas as pd\nfrom sklearn.metrics import accuracy_score\nfrom tests.utils.timing import train_and_"
  },
  {
    "path": "tests/integration/basic/test_categorical.py",
    "chars": 3128,
    "preview": "import unittest\nimport numpy as np\nimport pandas as pd\nfrom sklearn.metrics import balanced_accuracy_score\nfrom lightwoo"
  },
  {
    "path": "tests/integration/basic/test_cleaner.py",
    "chars": 2961,
    "preview": "import unittest\nimport numpy as np\nimport pandas as pd\n\nfrom lightwood.api.types import ProblemDefinition\nfrom lightwood"
  },
  {
    "path": "tests/integration/basic/test_embedding.py",
    "chars": 1371,
    "preview": "import unittest\nimport pandas as pd\nfrom tests.utils.timing import train_and_check_time_aim\nfrom lightwood.api.types imp"
  },
  {
    "path": "tests/integration/basic/test_ensembles.py",
    "chars": 2726,
    "preview": "import unittest\nimport pandas as pd\nfrom sklearn.metrics import r2_score, accuracy_score\nfrom lightwood.api.high_level i"
  },
  {
    "path": "tests/integration/basic/test_jsonai.py",
    "chars": 2390,
    "preview": "import unittest\nimport pandas as pd\nfrom lightwood.api.types import ProblemDefinition\nfrom lightwood.api.high_level impo"
  },
  {
    "path": "tests/integration/basic/test_model_selection.py",
    "chars": 2988,
    "preview": "import unittest\nimport pandas as pd\n\nfrom lightwood.api.high_level import json_ai_from_problem\nfrom lightwood.api.types "
  },
  {
    "path": "tests/integration/basic/test_qclassic.py",
    "chars": 1641,
    "preview": "import unittest\nimport pandas as pd\nfrom sklearn.metrics import accuracy_score\nfrom lightwood.api.high_level import Prob"
  }
]

// ... and 48 more files (download for full content)

About this extraction

This page contains the full source code of the mindsdb/lightwood GitHub repository, extracted and formatted as plain text for AI agents and large language models (LLMs). The extraction includes 248 files (5.3 MB), approximately 1.4M tokens, and a symbol index with 861 extracted functions, classes, methods, constants, and types. Use this with OpenClaw, Claude, ChatGPT, Cursor, Windsurf, or any other AI tool that accepts text input. You can copy the full output to your clipboard or download it as a .txt file.

Extracted by GitExtract — free GitHub repo to text converter for AI. Built by Nikandr Surkov.

Copied to clipboard!